>> Mike Seltzer: So good morning everybody. It's... McDonough and his colleagues to Microsoft Research today. This...

advertisement

>> Mike Seltzer: So good morning everybody. It's my great pleasure to welcome John

McDonough and his colleagues to Microsoft Research today. This is the first of four exciting sigma processing-related talks we have after the [inaudible] 2008 Seattle conference.

I guess I first met John visiting Karlsruhe in 2002 or 2003, sharing a common interest of microphone arrays and speech recognition. And he's done a lot of great work related to the CHIL project with large-vocabulary speech recognition and difficult environments like parliamentary proceedings and things like that and meetings and other scenarios.

And he moved from Karlsruhe to University of Saarland in 2006 -- is that right?

>> John McDonough: Seven.

>> Mike Seltzer: -- where he is currently continuing this work in distant-talking speech recognition. He also has a book on the way on distant-talking speech recognition, which we look forward to. And as a side note, he participated in the Speech Separation

Challenge, which was one of his evaluations for source separation, and I believe he got -- is it the best number or -- so we look forward to hearing all about that in his talk. So please welcome John.

>> John McDonough: Okay.

[applause]

>> John McDonough: All right. So thanks, everyone, for getting up so early. Thanks,

Mike, for inviting me here after I told you I was coming.

Okay. So this is sort of an overview talk. I wanted to talk about a few of the different technologies that we use for speech separation and, in general, distance speech recognition. So this is not -- this talk will not concentrate on any one single technology.

We won't concentrate on beamforming, we won't concentrate on search, we won't concentrate on speaker adaptation.

But what I hope to address here is sort of how all of these things are related and how in order to do good distance speech recognition you have to get all of the little things right.

You have to do everything right, not just the thing you like doing.

Okay. So we're going to talk a little bit about beamforming and how it relates to blind source separation; a little bit about speech enhancement; a little bit about single versus multichannel signal processing; and a little bit about, well, how beamforming is different from speech enhancement, why it's not the same thing; and a little bit about how the what are currently sort of separate research communities can influence each other and help each other.

Okay. So speech separation. A few years ago there was a paper by Para and Alveno

[phonetic], and they spoke about how they can use decorrelation to separate speech essentially. And they actually looked at two beamformers and they imposed a geometric constraint on the beamformers, and through this geometric constraint they minimized not

a -- the variance of the output of the beamformer, rather they decorrelated the signals.

Okay.

So we actually learned about this work after we had written a paper about minimum mutual information beamforming. And it was kind of interesting because after we had gone through all the math, we realized that minimum mutual information beamforming is equivalent or nearly equivalent to this approach by Para and Alveno. If you make a

Gaussian assumption. But if you make a non-Gaussian assumption -- and speech, by the way, is non-Gaussian, so I'll mention this early, because I intend to mention it often, speech is very non-Gaussian. It's something quite different.

So we've all seen what speech looks like. Okay. Speech is very non-Gaussian and it's also very nonstationary. So this utterance is actually -- is Victor [inaudible] in the audience? This utterance is distant speech recognition spoken by my colleague. Okay.

So you can see that during the periods of vowels you have this highly harmonic structure, you have a lot of energy. In between the vowels you have much less energy and it's not periodic at all. Okay.

As Mike mentioned a couple of days ago at the talk, during his talk, it's unfortunately the case that when people devise beamforming algorithms they don't take into account the nature of speech. Okay. We think that's a mistake. We think you should -- there's only one important signal, right? We don't care about beamforming for underwater acoustics or sonar or something. Speech is the important signal, right? Let's take into account what speech looks like.

So in addition to being very nonstationary, speech is also highly sparse in the subband domain. You can see that here, right? So especially during the fricatives, the consonants, you can see that speech is very sparse. But it's also sparse during the periods of voice speech. Why? Because during the periods of voice speech you have a harmonic structure. Okay. So if a subband lies on a harmonic, it's going to be strongly excited. If it lies between the harmonics, it will be essentially not excited.

Okay. Already said this. All right. So what we've looked at recently is trying to model the characteristics of speech. And one of the first characteristics we looked at was the non-Gaussianity of speech. Speech is not a Gaussian signal. So we wrote a paper a few years ago, it's recently appeared, and we looked at modeling speech with these so-called

Mayer G functions. Okay. Mayer G functions are quite nice because they a low you to extend the univariate PDF to higher-order variants. All right.

So it turns out, though, that you can do probably even better speech modeling if you use something called the generalized Gaussian PDF. So the generalized Gaussian looks like this. And so if you set -- here P is the scale factor. If you set P equal to 2, you'll have a generalized -- the regular Gaussian. Okay. If not, you have this normalization factor here that ensures that this sigma term is in fact the square root of the variance; i.e., the standard deviation of your PDF, okay? You can solve for this normalization factor as such.

So what's interesting about this is that by setting P to be less than 1, or less than 2, you can get a very super-Gaussian structure -- or PDF structure. So here is what PD -- what the PDF looks like for various different values of F. So you can see that for 2, P equal 2,

you get the Gaussian. For P equal 1, you get the Laplacian. Okay. But where things become interesting is when you set P less than 1, because in this case you get a very super-Gaussian structure. Okay. So your kurtosis increases every time you lower P.

Okay.

So you can see that for a very low value of P, you get a very -- get a PDF with a very high kurtosis, which means you have lots of probability mass in the center, not so much in these immediate regions, but then lots of probability mass out the tail as well. And this is what speech looks like.

So in this plot what we did was we actually estimated the optimal value of P or this shape factor on a set of training data. And we've plotted this optimal value of P globally for different values of -- for different frequency levels. So what you can see here is that, first of all -- okay. This is the shape factor actually, the scale factor. So you can see that, as we all know, speech is mostly concentrated -- the energy in speech is mostly concentrated in the lower frequency levels. What's interesting, though, is that speech is fairly non-Gaussian in the lower levels and becomes closer to Gaussian in the higher levels, higher speech levels or higher frequency levels.

So here is a histogram that demonstrates quite clearly that speech is not Gaussian. So you can see here in the green is the actual speech, okay, that -- this is what we modeled from the training data. And you can see that speech is not at all well represented by the

Gaussian distribution. The Gaussian distribution is this one. It's represented better by the

Laplac, still better by the gamma distribution, but best of all, so far, by this super-Gaussian density, but this generalized Gaussian density.

Okay. So it's kind of interesting that speech is not a Gaussian signal -- or, actually, it sort of stands to reason. If you know anything about the independent component analysis literature, then they claim that every signal that's interesting -- i.e., every signal that carries information -- is not Gaussian. Okay. Why is it not Gaussian? Well, this might not be exactly their explanation, but this is sort of the way that I understand it.

First of all, the central limit theorem says that if you add enough independent random variables together, regardless of their distribution, the sum will be Gaussian, okay? But in modeling speech, we don't care about the sum of independent random variables, rather, we care about one single independent random variable; namely, the speech. That speech is highly redundant, and that means that it will not be Gaussian.

You can think of a Gaussian random variable as somehow being, well, either the most informative or the least predictable of all possible random variables. It's the most informative because it has the highest entropy of all random variables for a given variance. It's least predictable because it has the highest entropy of all random variables for a given variance. So real signals -- i.e., signals that carry information -- are redundant, and this redundancy makes them predictable. Okay. We don't need to convince anyone here that speech is predictable. Okay. It can be coded. Its bit rate can be reduced. Right? So speech is note Gaussian, as I've said. Not Gaussian.

Okay. This is -- okay. So I'm sure we're all familiar with this little presentation here where if we take enough random variables and add them all together, the PDF becomes

Gaussian. So you've all seen this before, I'm sure. So here's a little MATLAB script I

put together. You can see that the very pointy Gaussian -- or the very pointy PDF is actually Laplacian. And if you add together enough of those Laplacian random variables, you eventually approach a Gaussian. So you can see that the very thick blue line is actually the Gaussian random variable. Okay. So here's in the linear domain, here's in the log domain. So you can see in the log domain the tails are eventually approaching the

Gaussian. All right. And so it sort of stands to reason that speech is not Gaussian.

Here again are more histograms of speech. And what you can see here is that when speech is reverberated or corrupted by reverberation or by noise, it becomes more

Gaussian. Okay. The central limit theorem says it should become more Gaussian.

So what we've looked at in the past few months or years is actually how we can make use of this fact that speech is not Gaussian and that it becomes more Gaussian when it's corrupted by noise or by reverberation in order to do better beamforming. So we started with a generalized sidelobe canceler, and we've all seen this before. Here's the input to our beamformer. Here's quiescent wave vector, blocking matrix, and our active weight matrix. And then in addition we might have a post-filter, okay, on the output of the beamformer.

Entropy is a basic [inaudible] from information theory. It's a measure of pure information. So here is the definition. If I take the log base 2, then I'm measuring entropy in bits. If I take it in log base E -- i.e., the natural logarithm -- I'm measuring it

[inaudible]. Okay. But one way or another this definition of entropy is a measure of pure information. Okay. The more information I have -- well, the more I learn every time I see this random variable, but the less predictable this random variable is, the Gaussian random variable is the least predictable random variable. Other random variables are more predictable.

So there's a well-known measure of non-Gaussianity from the field of ICA. It's called negentropy. And all it amounts to is the difference between the entropy of a Gaussian random variable and a non-Gaussian random variable. Okay. Both had the same variance, but one is Gaussian, the other is not.

You can liken this to also if you put this to that equation, you can see that quite obviously it's the expected value of a log likelihood because a Gaussian random variable and a non-Gaussian random variable.

There's also another well-known measure of Gaussianity; namely, kurtosis. Okay. So super-Gaussian random variables are actually called super-Gaussian because they have positive kurtosis. A random variable has zero kurtosis. Sub-Gaussian random variables have negative kurtosis, and super-Gaussian variables have positive kurtosis. Okay.

Here's the definition of kurtosis.

And generally as the kurtosis of a PDF increases, you're going to see more probability mass concentrate around the mean, more move out into the tails, and less probability mass in the intermediate areas. Okay. So super-Gaussian random variables are caused -- are characterized by infrequent but large deviations from the mean. All right?

So this sort of stands to reason, if you think about speech, speech is largely periodic. Not all of the [inaudible] names are periodic, but the vowels are. The vowels have the most

energy. So if you think about that spectrogram that I showed before, you saw that during the portions of voice speech you have this harmonic structure, okay, so if a subband falls directly on a harmonic, it's going to be very strongly excited; i.e., the value's going to be far, far away from zero, or far, far, far away from the mean. If it falls between harmonics, it's going to be as good as nonexcited.

Okay. So most of the fields -- with a few exceptions, one of the exceptions I already mentioned -- i.e., the work of Para and Alveno -- but most of the work in acoustic beamforming was taken over more or less directly from earlier work in beamforming.

And the earlier work in beamforming was done for microphone -- not for microphone arrays, but for antennas, underwater sonar, et cetera. Okay.

Most of the assumptions made in that earlier work do not apply to acoustic beamforming, because most of those assumptions were formulated for signals that propagate in a more or less free field. So Walter Kellermann might argue with me here, but my view is that those assumptions that were made for conventional beamforming really don't apply to acoustic beamforming. Why? Because in acoustic beamforming the two biggest enemies are noise and reverberation. So when I clap my hands, you hear the room ring down.

Long after my hands have contacted, come together, you hear that sound perpetuate.

Yes. That's reverberation. That's the big enemy.

It turns out that humans are very good at dealing with this reverberant environment, such a reverberant environment. In fact, the human auditory system is such that the intelligibility of the speech increases when it is in a reverberant environment, okay? You probably realize this is if you've ever been in an anechoic chamber; you realize that it's harder to understand someone taking to you than if you're in an environment like this.

The reason for that is because the human auditory system is such that it can use the early reflections -- i.e., the reflections from the walls and from the windows and so forth -- that occur within 50 milliseconds of the first onset to increase intelligibility.

At the moment we don't -- or we don't fully understand how to make our automatic speech recognition systems do this. Okay. This is what we have to learn. This is why distant speech recognition is still an unsolved problem. There are some people that would say, well, speech recognition, that's a solved problem, I can go to Egghead

Software and I can download Microsoft Vista; it has speech recognition, right? Not like this.

Okay. So what we've actually looked at in Trenso [phonetic] we presented a paper showing that you can actually use negentropy as an optimization criterion for beamforming. And, in fact, when you use negentropy, it works better than any of the other standard techniques. It works better than anything based on minimum variance distortion response or minimum mutual information -- or, sorry, it works better than

MMSC; i.e., better than minimum mean square error. Yeah. It works better than techniques based on those two approaches.

So in maximizing the negentropy of a beamformer, what we're actually trying to do is to restore the non-Gaussian characteristics of the original clean speech; i.e., we know that speech is non-Gaussian, its PDF is non-Gaussian. So in maximizing a negentropy

criterion, we're trying to eliminate the distortions caused by reverberation and by noise, which tend to make the speech more Gaussian. Okay.

It so happens that in order to use the negentropy criterion, as I've shown before, you actually need to know what the PDF of the speech looks like. You see it comes into play here. So this is not really a problem because, well, anyone doing speech recognition research has hundreds or thousands of hours of speech data online, so it's not so difficult to estimate a criterion or estimate this PDF. If you assume a -- if you make an assumption that the PDF is characterized by a parametric model.

If you want to really estimate the PDF of speech, that's more difficult, because of course then you have to use your characteristic function, estimate all the higher-order moments, et cetera, et cetera, and then it becomes much more difficult.

It turns out, however, that you can use also kurtosis as a maximization criterion. Okay.

We actually have a paper about this at Inner Speech next week. And the nice thing about kurtosis is that you don't need to know the PDF; you just need to know the second-order moments and the fourth-order moments of the speech. Okay.

By the way, I didn't do this work. This work was done by Ken'ichi Kumatani as well as my colleagues here in the audience, Friedrich and Barb.

Okay. So we actually looked at a simple model of what happens to the speech when we use kurtosis as a maximization criterion or an optimization criterion for beamforming. So we took a very simple example where we have one source, a wall -- i.e., a sound-reflective surface -- and then a microphone array. So as you all well know, when you have a simple MVDR beamformer, of course it's going to try to null out any source coming from anything but the look direction. Okay. This leads to the well-known signal cancellation problem.

It's kind of curious, though, if you take negentropy as an optimization criterion, it's not going to try to null out any source coming from anything but the look direction. Rather, if you have a source coming from something other than the look direction and it's strongly correlated with your desired signal, we'll try to amplify it. Okay. It will try to amplify it because of course in the subband domain a delay manifests itself as a face shift. This face shift can be compensated for by simply setting the active weight vector correctly.

So because you're trying to maximize a negentropy criterion, you're trying to make the signal as non-Gaussian as possible. This is what the negentropy criterion tells you, how non-Gaussian your signal is. You get a non-Gaussian signal by adding up the same portion of the original speech, because the original speech is very non-Gaussian. You get a Gaussian signal, on the other hand, by adding up what are essentially independent portions of the same signal. This is why speech becomes more Gaussian when it's corrupted by reverberation or by noise. Okay?

So, anyway, what you see here in this plot is evidence that with a negentropy criterion, you don't suffer from signal cancelation, which is the primary problem with the conventional beamforming criterion. Rather, you have here signal amplification. If the

signal comes from some direction other than look direction, it's still is same signal. So here you will try to amplify it.

So here are a few ASR results, which is what everyone cares about, not about simulations or any theoretical arguments. This is what you care about, right? So you can see quite clearly that the maximum negentropy beamformer does better than the delay in sum beamformer, better than the delay in sum beamformer with a post-filter, and better than the MMSC beamformer, the minimal mean square error beamformer, okay?

You'll notice, though, that the maximum kurtosis beamformer does pretty well. It's not quite as good as the minimum negentropy beamformer, but it's still pretty good.

>>: [inaudible] batch or online?

>> John McDonough: We did everything in batch, so that these are actually four passes of the speech recognizer. So the first pass is completely unadapted; second pass uses

VTLN and features-based adaptation; third pass uses VTLN, features-based adaptation and MLLR; fourth pass uses VTLN, features-based adaptation, MLLR, together with a speaker adapted training model. So still stronger acoustic model.

>>: The first, the beamforming part is the batch --

>> John McDonough: Yeah. So we haven't tried yet to develop an online algorithm, so we're just processing the entire -- it's utterance by utterance, but it's still batch. Yes.

What's interesting is that with the conventional beamforming techniques -- i.e., MMSC -- you actually have to beamform only during the silent portion of the speech -- i.e., when a speaker is not speaking -- otherwise it doesn't work. For the maximum negentropy beamformer, we beamform on the entire utterance. We don't care if it's silence or if it's during speech.

Okay. So here I wanted to talk a little bit about -- okay. Here I wanted to talk a little bit about what Mike Seltzer had done, and this sort of goes -- borders on what Barbara presented yesterday or the day before yesterday at [inaudible]. If you look at this diagram, speech, our time frequency plots, you see of course that speech is non -- very nonstationary. We all know this. So you have a lot of energy here during the vowels. In between the vowels, during the consonants and the fricatives, et cetera, you have very little energy.

So what Barbara has looked at recently is trying to use the hidden Markov model to model this nonstationarity. Okay. So what she has done is said, well, we're getting a nice improvement by doing -- by looking at using this negentropy criterion in order to model speech, okay, and trying to restore the non-Gaussian statistics of the original speech. But what we've done -- what we ignored so far, or prior to this work, in any event, is the fact that speech is nonstationary. It has very different energy levels between vowels and fricatives or consonants.

So what she has done is actually -- we've taken the first pass of the beamforming, done speech recognition, extracted a [inaudible] alignment from the speech recognizor and then taken auxiliary model that has the same context-dependent information as the real --

as the model we use for speech recognition, and then align that with the utterance. And then based on this auxiliary model, we've been able to back out the power spectrum density for any given short-time feature of speech.

So then we've actually used this short-time change in the speech energy in order to do better beamforming. Okay. And as you recall from the presentation a couple days ago, this actually yields a further improvement in -- rather, a further reduction in the word error rate.

So what Barbara has done so far is to take into account the nonstationarity of the second order statistics of the speech. What we want to do next is to take into account the fact that not only the second-order statistics are nonstationary but also the higher-order statistics. So we want to make both the shape factor and the scale factor dependent on the short-time characteristics of speech.

Okay. So what I said at the outset -- unfortunately I didn't have time to prepare slides to illustrate this remark. But, in our experience, if you want to do better distant speech recognition, it's very, very important to do all the little things right. We've seen that innumerable times in our recent work. So if you were there on Monday or Tuesday, you probably remember that I was asking many questions about which filter bank are you using.

It turns out that the filter bank that you use in order to analyze your speech is extremely important, okay. At the start of all of this, we were using the wrong filter bank. We were using a filter bank that was optimal for subband coding of speech. It turns out it was not optimal for adaptive filtering or for beamforming.

So it took us -- we were sort of starting out. We had no expertise in filter bank design, so it took us probably better part of a year to realize that. The filter bank that we were using was actually masking gains that we would have otherwise had by improving our beamforming or by improving our post-filtering. Okay.

It's also very important to optimize your recognition engine on the data that you actually want to recognize; i.e., building a recognition engine or building an acoustic model that's really good for close-talking data. Data from a close-talking microphone is not going to give you the best performance for your far-field task.

It's also really important to work on real data. We learned this the hard way, because when we started out doing this, we were working on data that was artificially convolved.

So we're thinking, wow, we're doing really good on this data, you know, we can do almost nothing at all, you know, we can do delay on some beamforming and our standard speaker adaptation tricks and we're within 2 percent of what we get with a close-talking microphone. Then you try it on the real data and you find out you've made no progress at all.

So, yeah, in short, you have to do all of the little things right in order to do good far-field speech recognition. That includes the speaker tracking as well.

All right. So I'm going to finish up a little bit early. Do we have a Web browser? I wanted to play an audio sample.

>>: [inaudible]

>> John McDonough: All right. So how do I get back to the desktop? Okay. Got it.

Oh, nope. Yeah.

>>: [inaudible]

>> John McDonough: Yeah. Okay. So --

>>: Do you want to play this one?

>> John McDonough: Yeah. All right. So here's little -- this is the only thing that I prepared that approaches a demo. So here is actually a segment of speech from the

Speech Separation Challenge. So this was organized by Mike Lincoln, who is at the

University of Edinburgh, and he actually collected data from the WSJ 5k task, Wall Street

Journal 5,000 word speech recognition task. But unlike the standard Wall Street Journal corpus, this data was read by speakers who were standing several meters away from the microphone array. So it was collected with a circular microphone array -- actually two circular microphone arrays of 20 centimeters in diameter, eight channels. So the speakers were speaking simultaneously.

[Audio playing]

>> John McDonough: So everyone understood what the second woman said, of course.

Mike Seltzer, what did the second woman say?

[Audio playing]

>> John McDonough: Something about the stock market losing 500 points in one day or something like that?

>>: [inaudible]

>> John McDonough: Okay. Let's try this.

[audio playing]

>> John McDonough: So this is what comes out of our beamforming, after MMI beamforming, minimum mutual information beamforming. We also use a technique called -- well, binary masking, which was also discussed in -- during [inaudible].

[audio playing]

>> John McDonough: It turn us out that because speech is so sparse in the subband domain that you can actually use a binary mask to separate the speech; i.e., you can look at -- you can compare the outputs of the two beamformers and see which one is louder in any given subband, and then set the other one to zero, which is in fact what we do. And you can see...

[audio playing]

>> John McDonough: That's without the binary masking. The binary masking really improves the separation.

[audio playing]

>> John McDonough: Unfortunately it introduces artifacts.

[audio playing]

>> John McDonough: So you can hear the sort of gurgling, underwater effect. In any event, we're still working on this.

>>: Are the first, second, and fourth essentially equivalent except for filtering?

>> John McDonough: No. Essentially these are more or less equivalent. These two here are equivalent except for the binary masking. So here's the original single channel of -- single channel of the microphone array.

[audio playing]

>> John McDonough: So there you hear essentially no separation. And this is with the

MMI beamforming as well as the binary mask.

[audio playing]

>> John McDonough: Here's the same thing without the binary mask.

[audio playing]

>> John McDonough: See? In this one obviously the second voice is still audible.

>>: Right. I'm trying to figure out is there something you can play that would show differences in filter [inaudible]?

>> John McDonough: No. I don't have such audio samples. But there is a significant difference. So I mentioned before that when we started out doing this, we were using the wrong filter bank; i.e., we were using the perfect reconstruction filter bank. So you can see the difference in word error rate.

It turns out that when we were using the perfect reconstruction filter bank, we got no improvement. In fact, we got a degradation by applying a post-filter. But with a filter bank that was designed specifically for beamforming or adaptive filtering, you see that we get the healthy improvement by doing post-filtering. Okay.

So it's -- well, I probably don't have time to go into this, but the perfect reconstruction filter bank is called perfect reconstruction because, well, it gives you perfect reconstruction provided you don't change the subband amplitudes; i.e., you don't do any beamforming or adaptive filtering in between the analysis and the synthesis.

It gives you that because it's based on a concept of alias encancellation [phonetic]. So the perfect reconstruction filter bank is a maximal decimation filter bank; that means your -- if you have M subbands, you're decimating at a factor of M. Okay.

It only works because the aliasing that is present in one subband is canceled off by the aliasing that's present in all the other subbands. But as soon as you start mucking around with the subband magnitudes, this no longer works. So Walter Kellermann knows, of course, about this.

So it's kind of funny because, yeah, filter bank design is an old science. But people realize relatively late that filter banks were good not only for subband coding but also for adaptive filtering and beamforming. That's why although the filter bank design is old, the good designs for adaptive filtering and beamforming have only appeared recently; i.e., since the turn of the century. Okay.

So in any event, as I said, it's important to do all of the little things right. It's important to look inside of the black boxes. You can't just say, okay, I have a filter bank. It's as good as any other filter bank or I have overlap add. Yeah, overlap add does even worse than any of these filter banks, oddly enough. Okay.

>>: So the feature attraction for this, if you go from the -- if you went back through the synthesis filter and then the feature extraction, you could also just attract features

[inaudible] output, right?

>> John McDonough: Yeah, you could.

>>: But then would this gain issue be as much of an issue?

>> John McDonough: In theory you could do that. I mean, for speech recognition we do everything based on the frequency [inaudible] representation. We put a Hamming window on the speech and then we take an FFT. Yeah. Unfortunately, it's not quite as simple as that. So I don't know how to do what you're suggesting, because for beamforming you need really good frequency resolution. Okay. For speech you don't need such good frequency resolution; rather, you need good time resolution. That's why, in fact, when we do beamforming -- or, rather, when we do analysis for speech recognition, we smear out essentially the frequency resolution. Put these Mel filters on top of it. We don't care about what the -- what exactly happened for the frequency resolution.

So for those reasons, we don't yet know how to go directly from sort of the beamforming frequency representation to a representation that we can use for ASR. We have to go back through the synthesis filter. We have to resynthesize the output.

>>: So what do you end up with on the output of the [inaudible]?

>> John McDonough: [inaudible]?

>>: How many frequency bands and [inaudible] and how many subbands?

>> John McDonough: Well, we end up with a time signal.

>>: But, I mean, if you stop, how many subbands are near adaptive?

>> John McDonough: Okay. So we typically use, what, 256 subbands. So we're looking to increase the number of subbands, because if you think about it, the number of subbands that you actually use is related to the length of the filter that you're applying to each of the microphone -- each of the outputs of the microphone array. So it's kind of strange because to do better dereverberation, we would like to increase the number of subbands because we'd like to have a longer filter so we can dereverberate the signal better.

On the other hand, for speech recognition, we would like to have shorter subband or smaller number of subbands. Because there -- in speech recognition, as I mentioned, you are trying to do better -- you want better time resolution. So right now we're using 256.

We want to increase the number of subbands.

We've also thought about -- okay. So currently for our beamforming we're using only one single subband block. We've also thought about doing multiple subband blocks, sort of like Walter Kellermann and Armin Sehr are doing, you know, to try to account for the fact that you have this fairly long delay between the arrival of the direct path and the reverberation, the early reflections.

We've also thought about using a particle filter to actually account for the late reflections, because in the late reflections, essentially what you have is a diffuse noise field. And several people have shown -- well, researchers at NTT have shown that you can actually use multistep linear prediction in order to count for this late reverberation when the reflections are essentially so numerous that you can no longer distinguish individual reflections. You have a diffuse noise field. Okay. All right. Any other questions?

>> Mike Seltzer: Thank you, John.

[applause]

Download