>> Good morning, Tashev. It's my pleasure to introduce Dr. Nikolay Gaubitch from Imperial College in London. And he will present us his work in the area of noise robust blind system identification and subband equalization of room transfer functions with applications to speech reverberations. So without too long introduction, Ivan have the floor. Nikolay: Thank you, Ivan. So good morning, everyone. Thanks everybody for coming along. And especially thank to Ivan for inviting me over here. It is a great honor to give a talk for Microsoft Research. It's quite a long title. It's almost like the completion of the talk in the title. So I guess many of you are familiar that I have this sort of, for completeness, the dereverberation issue and reverberation is you have illustration with a talker at some distance from a microphone where you get often, hopefully you get direct path sound and some reflections which are somewhat delayed or attenuated versions of this direct account. And this causes lot of trouble in hands-free telephony and sound reproduction. And I say that it can have adverse effects because actually some situations is good to have reverberation. Like in music, for example, people like it. So my talk will be generally on two related, interlinked but separate blocks. One is adaptive system identification, and one is on equalization. So I'll start with a quick introduction, just of the problem, formulating it, a few pluses of the reverberation methods that I believe you can split existing methods into, and then the two main bits of system identification and equalization. So this is, I guess, familiar to many of you work with this. You have a special signal SMS produced in reverberation and often you model the room acoustics as an FIR filter. And observed signal is a combination between this FIR filter of the room close response and the speech signal plus some additive noise. And the aim of dereverberation is to find some kind of estimate of this speech signal, S of M, possibly delayed or scale version of it, using the observations X only. So this is known, I guess, to, as a blind problem because we only have X and none of the other signals available generally. So I believe, at least from my thesis work, that you can divide the current existing dereverberation algorithms into three large classes. One is B forming, where you have some kind of array of microphones and you transform bin towards desired speaker, the source, and exclude any other interfering sources. And this is exclusive to multiple microphones. They have some speech enhancement methods where often you have a model of your speech signal perhaps based on the speech reduction model of humans. And you try to process your reverberant observation so that it better fits this speech model that you have. Alternatively you can also have a model of the room impulse response, which is (inaudible) kind of spectrum fashion algorithms for dereverberation. And finally you have this, the most notorious (inaudible), the blind system identification and equalization type of methods where you try to explicitly estimate the room impulse responses and design some equalization filters based on these estimates so that you can remove the (inaudible) reverberation. And the bits I'm going to talk about fall into this last category. So -- yes, so this is the blind system identification issue, is to try, try and estimate your impulse responses, H, using only the observed signals, X. The rest of it is sort of in the dark. So many of the algorithms which exist for this, the multi-channel system identification algorithms, are based on this cross relation between two microphones. So you have basically the observation of the first microphone involved with the impulse response of the second microphone, is the same as the observation of the second microphone involved with the impulse response of the first microchip. And using this, you can form some system of equations which give you the opportunity to find the solution for H by -- which is actually a scaled solution of H by finding the (inaudible) vectors corresponding to the smaller (inaudible) of R, which is a correlation matrix. And this works, provided that you have, that the channels are significantly different where you have no common zeroes between the room transfer functions, and that your excitation signal is, contains large enough spectral variety so that you can get this. Once step further is if you take this cross-relation, you can actually formulate an error signal which is extended to any number of channels, such as taking various combinations of these channels. And you can use some kind of adaptive filter to minimize this error. The good thing with an adaptive filter is that, interior at least, it's able to track some kind of, some changes in the, in the system, which occurs quite often with room impulse responses. As you move around the room impulse response changes, for example. So you can have this cost function based on the errors of all the combinations of channels, and there is a few different implementations, mainly by Quong and Dynasty, which came out both (inaudible) the frequency domain and in particular this, the normalized multi-channel frequency domain, LMS, which is quite an efficient version of these algorithms. And before I move on, I just introduce briefly this normalized projection misalignment, which is often used for measuring the misalignment between your estimated channel and your true channel. And the projection bit actually helps avoiding any dependance on the scalene factor which occurs with these algorithms, so that's bit. So now if you look at the quite ideal example with this, with the frequency domain adaptive algorithm, you have some three randomly generated channels. Fairly short, 32 taps. Input is wide delta noise, and there is no additive noise so ->> Question: What (inaudible). Nikolay: It just means, you just have some random taps that you generate. There's no ->> Question: (Inaudible). Nikolay: The impulse response -- exactly. So -- yes, so it is quite ideal conditions for the algorithm and works well, both in terms of minimizing the error. Here you have the top function on the Y axis and number of reverberations on the X. And here is the projection misalignment versus the reverberation, so that's fine. However, when you try and have some additive noise, for example white noise, the problem becomes that the cross relation doesn't hold anymore so your cross relation error actually consists of two terms. One is the will original desired cross relation error which is the good bit, which is one; but there is also a component here which is, includes the noise of the channel estimates. And trying to minimize these two components together, kind of messes up things a bit. And you can see here an example of what happens. The left -- (inaudible) they have strange behavior, that you start getting down, converge towards the right solution and then they misconverge into something else. And this seems to be unavoidable. So if you just try and have a smaller step size, this, the only thing that changes is this point of misconvergence, that it moves a bit further later on. And the other interesting point is that this is 40 dBs and R, which is pretty high S and R. You still get this effect. So very, very small amounts of noise disrupt this algorithm. And the question is, what can we to do to avoid this effect, for example. One thing we thought about, is to try and include some kind of constraints into the adaptation. Use more information from the, about the impulse responses or about your environment to try and stop this misconvergence. Of course one obvious thing to do is that if you have a, one known fact in your impulse response, you try and force the algorithm to always keep that tap the same. And one tap you could use is the direct path component. So the direct path of impulse response, which you could say, estimate with a, another algorithm for trying to delay arrival estimation. So this is one, one way to do it. Another, another observation that we had is that this misconverged solution has some strange low cost effect. It always tends to this low path tilted spectrum. And what we talk of doing is to impose a constraint that your spectrum should have somewhat uniform energy distribution across all frequencies. It's a debatable assumption whether this is the case but some, I guess some parts of room acoustics support that idea that you have uniform distribution. Either way, whatever constraint you take, you can -- it just adds this -- you can put it to the lagrange multiplier and just essentially adds this penalty term into the adaptation process. And here is just one example of, again, quite short -- these are shortened room impulse responses, simulated such. They are five channels so each bit here represents one channel of the true ones. The room impulse response truncated 128 taps. Here is the misconverged solution, which was ->> Question: What was that again? >> Nikolay: -- eight kilohertz, here. So it is quite heavily truncated, /it's mainly the initial parts of the impulse response. So here you have the misconverged solution, using just the standard (inaudible) domain adapted algorithm. And If you impose the spectrum constraint, you actually manage to stabilize it, although you delay the convergence a little bit. And the same effect you can also see with the direct path components of distribution. They both work in the same way. And here you see the estimated channels which are, sort of correspond quite well to the original ones. Another thing to do to try and improve things with adaptive filter is to try and get some kind of control of the step size rather than having a fixed one. And this is what, another bit we looked at. So we tried to get a step size called optimal. And it's optimal in the sense that you want the step size to minimize the error between the true solution and your estimate at the next step, given your current estimate of the impulse response. And over in this, you obtain a step size, which at each iteration can be calculated using this. Now, this contains two terms. One is this bit here, which has the gradient and your channel estimate, which is fine. This is available to you so you can calculate it; but then you have this gamma term, which depends on the true channel estimate. So of course it defeats the point a little bit if you need the true channel estimate to calculate it. If you look at the noise-free case to start with, this term is actually zero because the true solution and the gradient will be (inaudible) each other at all points. So your optimal step size is fine. calculating this bit. You can just get it by However, in noise you don't have that. And what happens in noise is actually the discomponent -- so, sorry, just going back to the noise-free case. As you see, that as the true solution -- as the estimate comes close to the true solution, it drive the step size to zero. It just helps the allocations, which is what should happen. In the noisy case, this is what gamma does, is to help to support this driving of the optimal step size to zero. And we thought of trying to find the, some kind of approximation of this gamma term based on the bits that we have available. And, yes, so you can have some approximation of it. This is the one we used. And plug that in to the algorithm. So here, can show you just a couple of examples of what happens. So first, one of the issues with the optimal step size which comes across is that, with the voice selecting a step size manual, which can be quite troublesome, here you see that you have a new, step size of 0.02. The algorithm goes crazy. If you have 0.01, it is fine. But it is kind of difficult to tweak it and so on. So if you have the optimal step sizing, it selects it finally. Just converges, no problems. And here also, you see that the two lines that overlapping. So whether you have this approximation of the gamma term or not, there is no benefit when there is no noise, which is okay. On this figure -- so here you have the misalignment on the Y axis, and then iterations from the X axis. You see that the, if you introduce this approximation of the gamma term, you can actually gain some extra performance in the estimation compared to assuming that it's zero all the time. And another interesting bit is that if you do use the true solution as a step size, you get some really, really good convergence as you would expect. But I mean it is not a useful thing, but it gives you some kind of performance limit of what you can get with this step size. So just to sum up this part of the talk, is that generally this, when you have some measurement noise, you disrupt this cross relation error, and your performance degrades with these algorithms. However, if you try and impose some constraints based on additional information that you have, you can increase (inaudible) noise quite a bit. >> Question: Just a question to you. exactly are you ->> Nikolay: I mean -- >> Question: -- (inaudible). So when you say measurement noise, what >> Nikolay: Measurement noise is generally any noise. So it could come partly from your equipment that you have or noise from other sources in the room. >> Question: But those noises, bit noises are different. >> Nikolay: Yes. Yes. >> Question: So in the (inaudible) ->> Nikolay: Yes. >> Question: -- and the external noise it's correlated because it is still valid noise sources (inaudible) that channel, the microphones and system over. >> Nikolay: correlated. Yes, yes. So, I mean, here -- everything I looked at here is no Uncorrelated noise, yes, generated. Yes, if you have any other questions about this part before I proceed to the second bit. Okay. So I'll look at the other side of the, of the issues, the equalization. So let's assume that you have some kind of estimate of your room impulse response, H. And you want to design some equalizing filters so that you get, when you involve your impulse response for (inaudible) filters, you get impulse which can be delayed or scaled version of true impulse. And that's sort of the ideal equalization, I guess. As many of you know, coded is quite a few practical programs. With this, one is the room impulse responses are non-minimal trace. And so you cannot just have a single channel stable causal universe filter achieved. And the second problem is that H is normally several thousand taps long, which causes problems when you try and calculate the filters. A third problem is that you normally have errors in your estimates of H, as you can see from the previous part, so it can distort equalized signal, if you design your equalizers based on inaccurate estimates. And finally you have this quite large dynamic range of the room transfer function so that equalizing it exactly could boost quite a lot of narrow band noise. So the one issue I'm going to look at here is the multi-channel lease equalization. So instead of having one filter for a channel equalizing, you can have a combination of filters, which all together equalize the signals and give you one output. And this is generally known as the mint, mint algorithm, which is just a (inaudible) squared solution to the problem performed from this relation. And the good bits with this is that, first of all, it eliminates this non-minimal face problem so you can actually get perfect equalization, of course if you have no common zeroes between the room transfer functions. So the transfer functions have to be different. And the problem with this, on the other hand, is that it is quite sensitive to this, wrong estimates of the room impulse response. Some attempts to improve that were made with some good results, I guess, by introducing a regularlization into the minimization here. Okay. I can just show you a couple of examples of what happens if you compare this perfect multi-channel equalization, or if you have an approximate single channel lease (inaudible) equalization. So you have some kind of magnitude distortion and phase distortion of the equalized output. I'll go through this a bit in more detail, these measures, later on. And you have the system mismatch along the X axis. So here, the plot A on both sides is the exact equalization with, with two channels, whilst the B plot is approximate equalization with one filter. And as you see, if you have perfect estimates of your channels, you get perfect equalization. If not, you start to get quite bad performance of these exact equalizers. Plus the, this more approximate single channel one degrades much more gracefully. Unfortunately, to get this, you need very large filters. And this case, I've used the 15 times the length of the impulse response filter for both single channel. So that's not, not, not ideal. And another interesting observation, which I think is -- if you look at the length. So if you vary the length of your filter and you equalize it again with the multi-channel and the single channel equalizers, you see that the error actually grows with the length of the filter that you are trying to equalize, although you are keeping the misalignment the same. So you get more and more troubles. So the idea from this is actually that it is best to try and do some short filter equalization rather than this long impulse responses. >> Question: (Inaudible) Y axis (inaudible)->> Nikolay: Sorry, it is not really in dB. It is kind of dB. I'll show you the expression a bit later on. But this is just demonstrated. You get some problem with this. It's essentially just the deviation of the magnitude from its mean. That's what it's supposed to be. Yes, so to try and shorten things, subbands, which people do. So you multichannel case equalizers. You equalizers. And you sum those up, one obvious thing to do is to try to go to have -- this is a figure on the fullband have the signal, some channels, the and you get your equalized output. Now, if you want to do this in subbands, one way to look at it is this, which is somewhat a conceptual version. You plug in this into the subband structure. So you have the signal. And each subband, you have the channel equalizer and the output. I know this is quite a messy figure, but I think it does explain what I'm trying to get to eventually. So -- yes, so it just changes the order a little bit. This raises -- like, from this figure there is a few things. So one is how you choose your filter bank. The second is, if you have only the fullband estimates of your room impulse responses, for example, how do you relate those to these subband filters so that we can design these subband equalizers. So one -- starting with the filter bank, the one we chose to use was this oversampled filter banks and, in particular, generalized discrete (inaudible) structure. The reason I chose that is that partly it's fairly straightforward to implement fractional oversampling, and also there is some efficient versions implemented for this. And why we want the oversampling is that we can perform filtering in the subbands approximately by using only one filter per subband rather than (inaudible) crossovers which normally have to have otherwise. So there's two properties which are helpful to this. One is that, in this filter banks you can suppress the alias in the subbands significantly. And also there is very little magnitude distortion, the output of the filter bank. And having that, you can relate your foo bank filter to the subband structure where essentially what we want to do -- now, I want to find the relation between the foo bank transfer function and the subband filters. So I want the transfer function of this to be equal to the total transfer function of the filter bank so that if you input the signal here, you get an output here, which is having the same iteration as these two. And one way to do it is this, some guys who use this approach and you can arrive to a least quest (phonetic) version to approximate these filters because they are not exact but it is, you get pretty good results in terms of the error between this signal and this signal by using this, this, the composition. And the other major advantage is that your subband filters of the room transfer functions are now the length of the foo bank filter divided by your destination ratio, which is what we wanted to get to. And if you now have these estimates, you can design your equalizing filters for each subband based on these subband equivalent filters. And you achieve overall equalization by applying the filters to each subband and then reconstructing the equalized foo bank signal. And of course since the impulse responses of the room transfer functions have reduced by a factor of N, the summation, so do the equalizers. So let me show you just a couple of results with this, with this approach for equalization. So for this, for this example we used the filter bank with 32 subbands which are decimated by a factor of 24. You have a 512 tap prototype filter for the design of the filter bank. We simulate the system mismatch by just adding some noise to the impulse response that we're trying to equalize so that you get some desired misalignment. And these are the two measures that I had previously also, is the magnitude deviation and the linear phase deviation. So these are the two things we measure separately. So magnitude deviation, as I said, the deviation of the equalized signals magnitude from its mean, while the phase deviation is the deviation of the equalized signals phased from a linear phase. So in the first case, this is just many randomly generated channels with taps randomly generated. The channels are 512 tap long and is five different channels. So you can see here, the magnitude deviation on the Y axis phase is deviation on the Y axis down here, and the system mismatch across the X axis. So for the foo band multichannel (Check) equalization you get what we got before, pretty bad distortion and you start getting inaccuracies in your estimates. Also with the subband case, you get a much more graceful degradation again, now when the filters are much shorter. And here is the case with room impulse responses of the results averaged over microphones. So you get, channels before. longer impulse responses. So these are simulated 4,800 taps. Again, there's five microphones, and 10 different locations of the source and the you get pretty similar performances for the random And another nice thing with -- sorry. Yes. I'll just though show you a couple of examples with this, one case when you are here, and one case when you are here so that you can see what the output looks like. So here is the impulse response of the output at nearly, at noise-free estimates of your impulse responses. And here is the magnitude. So you can see that it is pretty well equalized. On the other hand, when you have the minus 10 dB misalignment, you start to get some rubbish in the equalized output. However, if you listen to it, it's -- it doesn't distort the signal, which is quite interesting. It just changes the, it changes the spectrum as you can see from this, but it doesn't distort it. So I do have some samples. I'm not playing any samples here because generally with reverberation, playing through the room, it doesn't really get across the effect. But I have some samples on the computer if anybody wants to listen with headphones. Later on you can listen to what these things sound like. And another good benefit from this subband equalizer is that you can actually achieve quite a lot of savings and computations. So this is faulty point operation versus your impulse response length and it's somewhat consistent. So you get the factor of 120 reduction in computational complexity so that allows you to convert quite, the quite long room impulse responses fairly easily. And again, to just summarize this bit of the talk is, I think the key points with this subband equalizer is you get the more efficient version of an equalizer, and you can improve your robustness to noise estimates of your impulse responses. And if you get quite near perfect equalization, it mismatches it smaller than minus 40 dB. And also, I think I like this, the composition, the complex, the composition to, for relating the foo band and the subband impulse responses as general studies for any subband development of equalizers or (inaudible) identification. So just finally, I'll show you a couple of figures on applying this equalizer to speech signals and some results on the dereverberation in terms of back spectral distortion and segmental reverberant to speech to reverberation ratio. So the -- I know that this is a debatable issue of choice of measurement for evaluating reverberation of signals, but this is what I used here. And here you have several thoughts. So the bottom one -- so this is segmental signal to reverberation versus various P 60s. This also uses simulated room transfer functions (inaudible) microphones. So here, at the bottom is the unprocessed reverberant signal. The first figure, the first plot is just a delay in some bin former, what that would do. And then each of these plots is for zero dB, minus 30, minus 60, and no noise exact estimates of your impulse responses. So you see that if you have your perfect estimate, you can get perfect dereverberation more or less. And with noise you are still doing quite well with bad estimates, your impulse responses, at least in terms of signal similarity. So though the debatable issue with measurements is whether they represent the audible differences. I believe that we have some kind of signal similarities which reaches points down to here, it probably sounds pretty good. And the similar thing you can get also with, in terms of bad spectral distortion. Again, here is the unprocessed signal. This is the bin former. And down here are all the lines for the equalized signals. >> Question: So in this case we can say the (inaudible) distortion is mostly (inaudible) while the segmental (inaudible) reverberation ratio (inaudible) variable. I think (inaudible) what we are doing (inaudible) down there. >> Nikolay: Yes. >> Question: (Inaudible). >> Nikolay: Yes. Well, yes. And if you know can relate, I guess, the broad spectrum more to the audible difference, it probably shows that the audible difference between these three is not huge. That although you may have slight difference -- so the point is that going from, you know, 15 dB to 30 probably doesn't make a huge difference when you listen to it, which I think can be true to some extent. That's -So just to sum up the overall talk this far, I think is that this adaptive system identification and equalization, (inaudible) equalization, can provide really good dereverberation in theory, which is good to have some kind of -it is possible to do. However, in practice, it is difficult because you have a lot of noise and other issues so it takes quite a bit of tweaking to get things working. However, I believe, and I think that was what I tried to get across, is that if you try and collect as much information about the environment and put it in as constraints to the algorithm, you can gain quite a lot of robustness and transfer, reduce the blindness of the problem. And also the other bit is to try and reduce the dimensionality of the problem by divide and conquer approaches. And this hopefully will lead to some more practical algorithms using this type of approaches. And just before finishing off, I would like to thank a few of the people I've had the pleasure to meet and work with in relation to all this. And in particular I mentioned Patrick Hayter (phonetic), who was my PhD supervisor and continues to be my support in things that I do. So thank you for listening. Thanks for your attention. [APPLAUSE] >>> (Inaudible). So Nicolay (inaudible) this afternoon. If there are no more questions, we can take a break. (Inaudible) single processing talk here, which is organizing by today. And in this case we'll hear three persons from (inaudible) processing center. And then, from my understanding, the audience will be (inaudible) can take a break of 15 minutes and be back here again in two hours. Thank you, all, for stopping by. >> Nikolay: [APPLAUSE] Thank you. Thanks. Thank you, Nicolay.