>> Ivan Tashev: Well, good morning everyone, those who... in the lecture room and those who are watching us...

>> Ivan Tashev: Well, good morning everyone, those who are present in person in the lecture room and those who are watching us online. It's my pleasure this morning to introduce Mark Thomas, who received his bachelor's degree in 2006 and his PhD in computer science in 2010, both from Imperial College in London. And this morning he's going to talk about Microphone Array Signal Processing Beyond the Beamformer. Without further ado, Mark. >> Mark Thomas: Thank you, Ivan. Okay. So thank you for your kind introduction, Ivan. My name is Mark Thomas, and I'm currently a post-doctoral researcher at the Communications and Signal Processing Department at Imperial College, London. And I'd like to thank you for coming to my talk entitled Microphone Array Signal Processing Beyond the Beamformer. I'd like to welcome any questions throughout the talk, so feel free to interrupt me at any point. So array signal processing has been around for quite some time. In the '50s there was quite a lot of research into phased antenna arrays for steering a beam. But it's only relatively recently that we've started seeing multiple microphones in consumer devices. And I'm sure I don't need to point out the round-table device and the connect device. There are now atmosphere cal microphone arrays including the Eigenmike and the 64 channel Visisonics array. Imperial College has developed a number of planar and linear microphone arrays for research purposes. And the motivation behind these is beamforming, which is a very powerful technique for spatial filtering. But what I would like to talk about in this talk is other ways of using multimicrophone measurements. So the outline of my talk is that I will begin with an introduction showing the kind of application scenario that I would expect to be working in and talk about some of the notation. Now, beamforming is a method that makes some assumptions about the way waves propagate in space. What I would like to look at are some other kind of models that we could impose. And the main topic of this talk is multichannel dereververation. I would first like to talk about spatiotemporal averaging, which applies models of the voice. And equalization by channel shortening, which applies a channel model. The acoustic rake receiver, which applies geometric model in addition to beamforming. And then in the second part I'll talk about geometric inferences. And what we mean by inference is the estimation of the position of line reflectors in two dimensions in and acoustic environment using acoustic measurements. I will then conclude at the very end. So the application scenario, as I'm sure is very familiar to many of you, is that we have an acoustic enclosure. And inside this enclosure there's a talker at some point in space. Somewhere else is a microphone or an array of microphones. And this will receive the wanted direct path signal. Slightly after that will be early reflections from hard surfaces like walls and tables. A bit later on, the late reflection caused by the high order reflections or the reverberant tail, as it's sometimes referred to. And there will also be some additive noise. And this could be from acoustic noise sources like an unwanted talker or some other noises within a room in a domestic environment. You'd expect to hear clattering and people walking around. And there's also the problem of additive noise coming from the measurement apparatus. This might be within the microphone or within some code in the [inaudible] and all of this adds to the observed signal at the microphone. So put a bit more formally, we can say that a speech signal, S of N is convolved using a filtering matrix H. This thing produces an observation Y. Added to this is a noise signal B to then produce an observation X. Now, what we can do with this X is perhaps apply beamformer. And the purpose of this is to try to appoint a logo sensitive that attenuates the unwanted talkers that are spatially distributed elsewhere. Another class of dereverberation and noise reduction of which beamforming is part will also aim to estimate the speech signal perhaps by some different means. And this I'll talk about a little bit later in the dereverberation part of the talk. Geometric inference aims to estimate the parameters L, which are parameters for the line reflectors in space. And I've also added the Boston blind channel identification. Now, although I don't talk about this implicitly in this talk, it's nevertheless been a part of my research as a post-doctoral researcher. And without multichannel observations and the spatial diversity that it produces blind channel identification can't be achieved. The other thing that I would like to say in the introduction is some notation that I'll use. The channel is also denoted by M and its length is L. And the observation at channel X -- channel M is XM of N. And what is often done, and I will be doing in this talk, is to remove the channel independence and stack all of the channels into one vector. So if there's no subscript, then it can be assumed to be a multichannel observation. So I'd like to now move on to the first dereverberation algorithm, which is called spatiotemporal averaging. And this relies on voice modeling. So I'd like to begin by looking at the source filter model of speech. And what this says is that air passing from the lungs passes through the glottis. And as it does so, it causes the glottis to vibrate periodically. And these excitations are then filtered by the pharyngeal, oral, and nasal cavities to produce sounds that we interpret as voiced phonemes. What we often do is lump the transfer function of the pharyngeal, oral or nasal cavities into one you function called the vocal tract. And this in the Z domain is represented by V of Z and is often modeled as a Pth order AR process. This is excited by the excitation signal or sometimes called the error signal from the glottis, E of Z. Such the signal that's produced, S of Z, is the product of V of Z and E of Z. And the motivation for separating this out is very, very common in the case of coding, where it's possible to achieve a very efficient coding by taking the V of Z and the E of Z, parameterizing them differently and then recombining them again at the receiver. And it's possible to estimate these transfer functions and the excitation signal blindly using linear predictive coding. What this does is try to minimize the square error between the speech signal and the predicted speech signal using the prediction weight A. And these As are found by solving this Wiener solution involving an autocorrelation matrix capital R, and a cross-correlation vector lower case R. And as I said, this is very common in the field of coding and works very well for single channel speech and providing that there's no reverberation. But what then happens if we have multiple observations in a reverberant environment? Well, we have two options. We could try to find a multichannel solution to the problem or perhaps more straightforwardly, we could take a single channel, linear predictive coding and supply it to the output of a beamformer. Now, let's say we take on the former case. We can come up with these autocorrelation matrices and cross-correlation vectors for every channel M in turn and then average them to form a more robust estimate R hat and lower case R hat. Using these two we can then find the optimal coefficients B hat opt. And with some analysis it can be shown that by spatial expectation that the B hat opt is a unbiased estimator of A opt. Now, if, instead, we were to opt for the second case, where we take the single channel LPC and apply it to the output of a beamformer, it can be analytically that these coefficients are no longer an unbiased estimate of the A opt for the clean speech. So this is a fairly straightforward example to show that it can sometimes be a bit better to come up with a dedicated multichannel algorithm as opposed to taking a single channel algorithm and applying it to the output of a beamformer. So let's take a look at the ->>: [inaudible] actually by cranking through the minimum [inaudible] error on [inaudible]. >> Mark Thomas: Yes. >>: Okay. It truly is the Wiener optimal solution? It is truly the optimal solution? >> Mark Thomas: It is truly the optimum solution. >>: Okay. >> Mark Thomas: Yes. But only by spatial expectation. So the effect of reverberation will of course add some error. But if you were to find the expectation at every point in the space, then it provides an unbiased estimator. So this analysis was done by Gaubitch, Ward, and Naylor in a paper in JASA in 2006. So the algorithm I'd like to look at is called spatiotemporal averaging. So as we've seen at microphone M, the output is given by XM of N. And we've seen already that we can estimate these optimal coefficients B. Now, what we also have is to try to estimate an enhanced linear prediction residual that then when resynthesized gives us a more accurate estimate of the speech signal. And the motivation for this is as follows: The top left here we have the clean speech signal, which shows this nice periodic nature and these extra wiggles due to the spectral shaping by the vocal tract. Beneath it is the clean linear prediction residual. And this is the features that we see are these periodic impulse events caused by the rapid closure of the glottis. So it sort of snaps together like hand clap. On the top right is the reverberant speech which we can still see some of this periodic nature in it but there's some additional noise caused by the presence of reverberation. But if we then look beneath this, we see that the residual has been masked very heavily by noise, and it's not really possible to pick out any impulsive features. And it's been shown in previous works that in the presence of reverberation the effect is to have a lot more distortion placed into the residual than it is to be placed on to the AR coefficients. So given that we know that the reverberant residual is so badly corrupted by the reverberation, we want to do as much as we can to try to fix this. So let's just look at a system diagram of what we've done so far. The speech signal passes through the acoustic impulse responses which then produces these microphone observations. We then applied the multiply channel LPC coefficients to give these optimal coefficients B. So the next thing is to apply some spatial averaging. And we do this with a delay-and-sum beamformer. And the aim is to take the output from the delay-and-sum beamformer which turns a multichannel observation into a single channel and then find a residual by inverse filtering the output of the beamformer with these coefficients B. So the idea of delay-and-sum beamformer is coherently sum the observations in such a way that these delays tau allow the coherent sum of the wanted signal, which is the speech, the direct path, causing the attenuation of the unwanted terms. And there are various ways of estimating this tau that I won't go into any detail about here. And as I said on the previous slide, the aim is to then take this output, X bar of N, inverse filter it with these multichannel LPC coefficients B hat to produce a prediction residual E bar. And the way we enhance it is then to exploit the pseudoperiodicity of the voiced speech. A few slides ago we saw that the speech signal is very periodic. And therefore we can expect that any features due to the voice source or this excitation signal is going to be common from one cycle to the next. And if we then perform an intercycle averaging, we should be able to boost those features that are caused by the voiced source in the glottis and attenuate all those terms caused by the reverberation. So we call this larynx cycle temporal averaging, which is applied to this DSB residual. So the way we do it is we consider a weighted temporal averaging of two I neighboring cycles. So we take a cycle and then we take I neighboring cycles on either side of it. And I is typically in the range two or three. The signal is segmented into individual glottal cycles by finding the glottal closure instants and using these to delimit the one cycle to the next. And then by performing this temporal averaging it's possible to attenuate those unwanted terms. But a problem arises and we don't know where these glottal closure instants occur. Now, this is a bit of a problem. And this is where the idea of multichannel DYPSA comes from. Now, DYPSA was an algorithm that was developed by Imperial College shortly before I arrived. And it's a method for taking a single channel piece of speech and estimating the glottal closure instants, those periodic impulses within the LPC residual. And it does this by first calculating the LP residual and then using a technique called the phase slope function. And this is essentially a measure of energy calculated over a sliding window. And the features that we find with the phase slope function is that the positive going zero crossings locate the positions of impulsive features within the signal. We'll see a bit more of this on the next slide. Once we've located these features, there will be a candidate set, and within that set are the true GCIs that we want. But there will be a lot of additional incorrect estimates that need to be removed. And we do this by applying a dynamic programming algorithm that I'll explain in a couple of slide's time. So just to give an example of what I'm talking about here, on the top is a speech signal, which shows this clear periodic nature. Underneath it is the linear prediction residual showing these impulsive features and a few bits of noise in between. And the aim is to locate these impulsive events. Now, if we look at the phase-slope function at the bottom, we see that the positive going zero crossings of this function identify the location of the true GCIs. What sometimes happens is a GCI is missed, as shown here, due to some additional random noise in the residual. And we have techniques to deal with this. We call it phase slope projection. Yes, question please. >>: [inaudible] you're whispering? >> Mark Thomas: I will discuss this later. Yes. So -- yes. So the question there is that there are -- there are modes of speech, fricative, whispering, non-voiced unvoiced, where there is no such periodicity. And there's -- I'll talk very briefly about how we deal with this. Please, yes? >>: I want to ask a question a little bit about the nature of the distortion that's happening from the reverberation. So it seemed like when you showed the earlier pictures and that I'm guessing it's also happening from the very noisy residual that you showed that happened at your reverberant speech is then, you know, essentially, you know, if you think about the wave forms it's like pieces are being moved around a little bit because, you know, reflections from -- so the baseline -- because reflections are coming in and affecting the magnitude, you're basically seeing that like what looks like the start of the career has kind of been moved over a little bit over here and moved over a little bit over there. And I guess what I'm wondering is that -- is that if you now apply this type of analysis, you know, to look at where the period's actually starting, you're not going to have like a super uniform. >> Mark Thomas: Absolutely. >>: [inaudible] there's going to be a little bit of ->> Mark Thomas: Yeah. I don't know if that was picked up by the microphone. But the question was that when we add reverberation the linear prediction residual is spoiled and as well as having these wanted peaks we get lots of unwanted peaks and becomes very ambiguous as to where these occur. So right now I'm talking about the single channel DYPSA algorithm. I'll then talk about the multichannel algorithm that tries to deal with this problem. >>: Oh, I see. >> Mark Thomas: Yeah. So what we're seeing here is the clean case. There's no reverberation. Any errors here are too frame -- framing errors. And there may be some additional aspiration noise in here that will manifest itself in the residual. So, yes, we'll talk about what we do in the reverberant case in just a minute. But, yeah, so the reverberant case is a particular problem, especially when we consider the erroneous error crossing. Right now in the clean case, this will happen every now and then. And we don't know right now what zero crossing is due to the true GCI and what is due to the erroneous GCI. So we need to apply an algorithm that tries to get rid of these unwanted signals -unwanted zero crossings. And what DYPSA does is apply a dynamic programming algorithm that finds the series of costs for each of the candidate in turn. And for each cost, some things like waveform similarity and pitch deviation are found. And it's expected that because of the pseudoperiodicity of voiced speech, the pitch deviation would be very low, it would happen very regularly and that the waveform similarity from one cycle to the next should also be very similar. So you make a cost that should be -- should be low for the -- for these terms. So they should be low cost. And the dynamic programming then finds a path through all of the candidate within some region of support that minimizeses this cost function using some additional weights. And it just so happens that these two speech waveform similarity and pitch deviation are the most important. And this has been found with training data. So, providing we have clean speech the single channel DYPSA algorithm works very well. And usually it will produce about 96 percent identification accuracy. This actually formed quite a large typical within my PhD. And there are lots of new algorithms that have superceded it. So that's -- and get much more accurate estimates of the glottal closure. But we'll stick with DYPSA for now. So going back to your question, the problem arises when we have reverberation. And it's a particular problem because the reverberation produces all sorts of nasty spurious peaks in the LPC residual. So in much the same ways with multichannel LPC we have two options. We could take DYPSA and apply it to the output of a beamformer or we could extend the algorithm in some way to make a multichannel variant of it. Now, let's say we do the latter. One thing we could do is try to generate a series of candidate for every channel in turn. Now, we know that there are going to be lots of additional spurious zero crossings due to the reverberation. But due to the spatial diversity we should expect that only the wanted signal is the one that contains all of those candidate that show some kind of coherence across channels. So we augment the cost vector to the dynamic programming algorithm that measures the interchannel correlation between the candidate. It then penalizes those with low interchannel correlation and encourages those with high interchannel correlation. So have a separate candidate generator for every channel in turn. And we had a paper at EUSIPCO, the EUSIPCO conference in 2007 about this. So if we look at what happens when we apply reverberation, in the clean case with a large database of speech we get about a 96% identification accuracy. An identification accuracy is defined as the number of correct type identifications divided by the total number of cycles within the voiced speech. >>: [inaudible]. >> Mark Thomas: The ground truth for this algorithm was done with synchronous EGG recordings. So an EGG is an electrogottal graph. This involves placing a pair of electrodes on the side of the glottis. And a recording was then made with a head held at a specific distance from the microphone. And then time aligned. And it's much more straightforward to pick out the glottal closure instants from this signal because it doesn't contain any nonstructured noise from acoustic noise sources. It doesn't have any filtering by the vocal tract. So we don't have any errors in removing the residencies. And this too was actually quite a large part of my PhD. And I've published quite a few papers on GCI and GOI, the opening detection from voiced speech. >>: You said EEG? >> Mark Thomas: The EGG. The electrogottal graph. Or electrogottal gram. >>: [inaudible] what you get? >> Mark Thomas: Right. >>: So [inaudible]. >> Mark Thomas: Well, it's a measure of the conductants. So across the electrodes is an RF source of about one megahertz. >>: [inaudible]. >> Mark Thomas: It's ->>: [inaudible] is measuring the [inaudible]. >> Mark Thomas: Yes. >>: Okay. Thanks. >> Mark Thomas: The conductant. And the reason why this changes is that if we're measuring the cross-section of the glottis, it sort of takes this zipper-like fashion and as it comes close together, then there's a lot more -- a lot higher conductants and a lot let -- and if anyone's interested about this later on, then I can show some examples and how we deal with it. Yes, so back to the performance. We've got this ground truth that we form the EGG estimate. In the clean case we do about 96% accuracy. If we then apply the single channel algorithm to the output of a beamformer -- sorry, apply the single channel output algorithm to the -- one of the observations, we see that the identification accuracy drops very, very quickly. And if we observe as well the time alignment of each identification, that also gets worse. But I haven't shown an example here. So one of the possibilities was to take a single channel algorithm and apply it to the output of a beamformer. And this is shown in yellow and not surprisingly it does a bit better because the beamformer has helped to attenuate the reverberant components. If we then take the same dataset and apply it to the multichannel algorithm it does better still. So we have another example of how sometimes it's better to come up with a multichannel variant of an algorithm rather than taking the single channel algorithm and applying it to the output of a beamformer. Another plug for a paper is -- has just been accepted on a number of different algorithms for glottal closure detection from speech signals in lots of different noise and reverberation environments. So that was a bit of a sideline on DYPSA. Now, let's get back to the spatiotemporal averaging. We have everything we need. We've got our observations. We have an algorithm that finds an optimal set of LP coefficients to give us residual. We apply that to a delay-and-sum beamformer to get a DSB residual. We now know where the glottal closure instants occur and are able to perform this larynx synchronous temporal averaging. And then this gives us the following. On the top is the clean residual, showing this nice periodic train of impulses. In the middle is the reverberant residual which is not quite as bad in this case as we saw earlier. There are some peaks in there, but some of them are buried in noise. Then having performed the spatiotemporal averaging we get this enhanced which looks much more like the residual in the clean case. >>: I have a question about the state vector of your dynamic programming algorithm. Is it -- is it just the location of the glottal closures or is it like the entire vector of the period. Like what's the actual [inaudible]. >> Mark Thomas: It's -- the internal state is based on the absolute timing of the ->>: I see. >> Mark Thomas: Of the GCI. >>: [inaudible] end up with the set of time ->> Mark Thomas: So we have a -- yeah. A set of time instants. And as you pass through time, the instants of glottal closure from the candidate are found. And with this is associated those six cost vectors. So the vector of six cost elements. And then it may jump some candidate that are incorrect and then find this path that within some region of support maximizes out the likelihood and therefore minimizes the costs. There is a trade-off. If you were to take lots and lots of cycles, it becomes quite conversationally intrackable and it comes a point where there are too many different possible paths to choose. What can also happen is if you take lots and lots of cycles, the pitch period will start changing slightly and if it's different at one end or the other, then the pitch consistency cost is not going to give you a reliable estimate. So there is this trade-off. If you make it too short then there comes a point when you're not benefitting from these similarity features. So ->>: Isn't [inaudible] from frame and frame? Like presumably like a standard dB setup, you have all these candidate at each time step and then you have the transitional costs between them. So the pitch transition cost would be one that from one state to the next it would -- it would increase or decrease based on whether they were consistent or not, right? But if it moved -- changed slowly through a long period, that shouldn't be so bad, should it? >> Mark Thomas: I'm trying to think how we achieve this. So, yeah, the question is that if we take lots and lots of cycles then the pitch consistency will change slowly. And it shouldn't matter from one end to the next. Now ->>: We can talk later offline. >> Mark Thomas: Yeah. I actually forget what -- exactly how we approach this. I forget whether it's done on a per-cycle basis or whether it's the pitch period as -at the variation of the pitch period compared with the oval pitch period. If it's the latter case, which I think it might be, then there's a problem. Because you will start seeing more deviation. But maybe we can talk about this later. >>: [inaudible] some sort of larynx center filter. But these are not exactly the same. So how do you get from reverberant residual to enhanced residual? >> Mark Thomas: So the reverberant residual first passes through the spatial filtering with the beamformer. And that will then help to enhance these ever so slightly. The larynx cycle temporal averaging is weighted using a two-key window. So what this does is prevent it from forming any averaging act, the glottal closure instant or it's weighted in such a way that there's less averaging at the GCI. But there's lots more averaging happening between them. And that's -- I'll go back a few slides. >>: So you essentially designed a sort of on the fly a filter that has sort of a -- I guess it's a spike at the frequency of the glottal index. And then is it -- is a stop band every where else, as much as you could. And then you apply it at the right phase to ->> Mark Thomas: Yeah. We don't -- we don't think of it so much in the frequency domain. This is all a time domain averaging. So there's no sort of pass band and stop band at this point. >>: [inaudible]. >> Mark Thomas: But, yeah, it's effectively a linear filter. So, yes, you could look at it in those terms. Actually this brings us very neatly on to what happens in the case of unvoiced speech. So thank you for pointing that out. And so if I just go back to where we were. Yeah. So we have the enhanced residual. So this is dealt with what happens in the case of voiced speech. Now, there are times in speech that don't have any kind of periodicity, anything unvoiced fricatives and such forth and indeed silence don't have this periodicity. So we don't really want to apply this temporal averaging during these periods. The spatial averaging is okay. But not so much the temporal averaging. So that brings on the very last part of this that involves a voiced, unvoiced, and silence classification. I didn't actually put any slides on this. But the idea is that we only perform the larynx cycle temporal averaging during the voiced speech. And at the same time, we make an adaptive filter which we call G, which looks at the difference between the outfit of the DSB and the enhanced residual and then forms a filter that effectively chiefs the same as the larynx cycle temporal averaging. This is updated during voiced speech, and it's done slowly on every cycle. And then during unvoiced speech, this filter is then applied instead of the temporal averaging. So this helps to give a little bit more robustness during the unvoiced speech. Although I don't have any slides on this, I -- if you -- if we have some time afterwards I can walk through the way in which we form this adaptive filter. >>: [inaudible] adaptive trying to get what the larynx cycle temporal averaging is attempting to do -- >> Mark Thomas: Precisely. >>: In time? In space? Both? >> Mark Thomas: Well, both. >>: One channel? >> Mark Thomas: Well, it's actually in time because it's driven by the output of the DSB. So it's only -- it's doing the same thing that the temporal averaging is doing but not by temporal averaging but instead by an adaptive filter. So it's a convoluted process that mimics this. I can -- it's -- unfortunately I don't have any information on this at the moment, but perhaps if we get some time at the end I can exit this and I can look a little bit closer at the approach that we use. Yes. But let's say that this works. We have then found an enhanced residual E hat of N. We can then resynthesize the speech using these multichannel LP coefficients to give an estimated speech signal S hat of N. And we evaluated this by taking a real room with a T60 of about 300 milliseconds. We placed an eight microphone linear array with a five centimeter space between the elements. And varied source-array distance from half a meter to two meters. The performance was then nonstructured using the segmental SNR, which is a means of estimating the SNR on a frame-based approach. And the bark spectral distortion score. With the segmental SNR more -- higher is better. With the spectral distortion, lower is better. And what we see is that if we look at the -one of the observed channels, in this case channel one, we get the worst segmental SNR and the worst bark spectral distortion. A bit better than that is the delay-and-sum beamformer, which is forming a spatial averaging only. And then better than that is then the spatiotemporal averaging which is doing both the spatial averaging of the beamformer and the temporal averaging of the larynx cycle and averaging. So I'd like to play some example sounds that were recordeded. So the first is the output of microphone one. >>: George made the girl measure a good blue ->> Mark Thomas: It always cuts off at that point. So again. >>: George made the girl measure a good blue vase. >> Mark Thomas: And then the output from the delay-and-sum beamformer. >>: George made the girl measure a good blue vase. >> Mark Thomas: So perceptionly it sounds like the microphone is closer, but there's some evidence of the reverberant tail in the background. If we then apply the spatiotemporal averaging. >>: George made the girl measure a good blue vase. >> Mark Thomas: Hopefully it should sound like the components of reverberation have been reduced. But we're in a fairly reverberant room here, and I think maybe to appreciate this properly you need to listen on headphones. >>: [inaudible] play the last two again. >> Mark Thomas: Yes, please. >>: What is the distance between the speaker and the microphones in the examples? >> Mark Thomas: In this example, it's two meters. Yeah. >>: George made the girl measure a good blue vase. George made the girl measure a good blue vase. >> Mark Thomas: So it's done its job. There are some artifacts. This is a fairly good example. Maybe if we have time at the end, I can play some bad examples where what tends to go wrong is the glottal closure instants are identified incorrectly and then the wrong amounts of temporal averaging is applied. And that then produces the musical noise. So I would like to summarize this section by saying that multichannel LPC decomposes a reverberant speech signal into an excitation signal and a vocal tract filter and that reverberation has a very significant effect upon the LP residual. Multichannel DYPSA then segments the LP residual into, glottal cycles allowing us to perform an enhanced LP residual estimated by this technique called spatiotemporal averaging. The speech is then resynthesized with the LP coefficients which is what we just heard. And the result of this is that the dereverberation performance of the spatiotemporal averaging is better than performing the spatial averaging alone. And there's a paper that we published on this at WASPAA in 2007. Yes, please. >>: So see -- it seems like you use a lot of knowledge of speech signal to -- as predicting what happens at excitation to do [inaudible]. And to do that, you sort of have to do a lot of the speech production analysis stuff. So there's some other techniques where you take the same basic approach but it's a little bit more -- it's more a question of changing objective function like for example let's say, you know, only [inaudible] come up with some [inaudible] filter so that the residual is peakier than it is for [inaudible] or minimizing how [inaudible] do this. So do you have any sense how -- you know, obviously [inaudible] very good. So do you have any sense -- does this -- do you get something out of the fact that you're exploiting a more probability speech knowledge or how this was prepared and things like that? >> Mark Thomas: Okay. So the question is that we at the moment exploit the speech signal, and we know lots of things about how the periodicity of the speech signal varies over time. And that there are other techniques that do things like maximizing the kurtosis of the LP residual. Although we've not made any direct comparisons between them, what we have done in the past is use the costs of the -- of DYPSA to give us an estimation of the reliability of the voicing. And rather than applying this technique exclusively to those periods in which we've found GCIs, we apply it only to those periods where we're pretty confident we can find it. Then during those times when we aren't confident in the GCIs then we want to apply something else. And that something else could include something like kurtosis maximization and such forth so, yeah, I think there are lots of interesting extensions that can be done here based on how reliable the estimation of voicing is and using alternative techniques to try to make this robust. Does that -- that's about all I can say on that. But thank you for the question. It's a good question. Okay. So moving on. The next algorithm I'd like to talk about is very different. And this uses not voice modeling but channel modeling. And I would like to talk specifically about an a approach called channel shortening, which is quite popular in the field of RF but not so common in the field of acoustics. Now, the same of acoustic channel equalization is to use the estimate of the acoustic system H and remove it by some means. Unfortunately, although there are some existing optical techniques -- optimal techniques, the effect of noise and estimation error in this -- finding this H, has a very significant effect upon the quality of the equalizing. And it really then limits how applicable it is to the real world scenario. But what I would like to talk about here is a technique that we've been working on called channel shortening and see how this helps to increase the robustness. So let's jump straight into a system diagram. This is very similar to the previous case when we have a speech signal that passes through this -- these convoluted process H with some noise to produce the observations X. And the aim is to find some filters G that when convolved with the observations and then summed over all channels produces an estimate of the speech signal. And we find these Gs using an equalization algorithm. And this is driven by a system identification algorithm that aims to estimate these signals -- these impulse responses H. So we can formulate the problem as taking a channel H and convolving it with an equalizer G that when summed over all channels produces a target function. And this target function is usually an impulse or a delayed impulse where there's just one nonzero tap. This can be represented in matrix form, where we have a filtering matrix H and the coefficients G that we want to find that then produce this desired response B. And the ultimate aim is to use these filters to form a filtering matrix, a convoluted matrix that when applied to the observations gives us an estimate of the speech signal. Now, one way of doing this is to use the MINT algorithm, the multichannel input output inverse. And this can be formulated as a multichannel least squares problem that aims to minimize the difference between the equalized response and the desired response. And this can be found fairly straightforwardly using the Moore-Penrose pseudo-inverse or some other kind of pseudoinverse of the filtering matrix multiplied by the desired signal. This thing gives us the G hat. And this can provide exact solutions. It can give an exact inverse providing there are no common zeros in the acoustic impulse responses age. So if we were to take the impulse response from the source to one of the microphones, factorize it, no two zeros should be in common with them. This -- if there are common zeros and this produces an ambiguity because we don't know whether the receive signal is due to the source or whether that zero is due to the -- the acoustic impulse response. So it's important then that there are no common zeros. The or criteria is that we have to have at least two channels. And the equalizer has to fulfill some length criteria. Now, these all three are all possible. The problem comes in that the filters H used to estimate the inverse filters G must contain no errors. Now, for a practical scenario, this can't possibly be the case. There will always be some degree of error in the measurement apparatus or the system identification approach. So we could relax the problem. It may not be necessary to equalize the signal entirely to produce a delta function. What we could do is appeal to a branch of psychoacoustics that says that the ear can not hear low order reflections. So those low order reflections from walls and tables and such forth perceived not so much as a reverberation but perhaps a coloring. This has been shown subjectively not to be so detrimental to the intelligibility of the speech. So we have in the middle here a stylized diagram that shows the direct path impulse response which is what we would really like to get at, the low order reflections which then combine into lots of high-order reflections that produce this reverberant tail. If we then make a waiting function that incorporates a relaxation window, as we call it, LR, we can say okay, these low-order reflections, let's not bother to try to remove them. Let's just let equalizer do whatever the hell it looks at that point and only try to attenuate the reverberant tail. Before this we use an algorithm that we call the relaxed multichannel least squares approach. And this involves a modified cost function that contains this weighing matrix W that has got zero entries within the relaxation window and therefore the -- by finding the minimized square error solution, those taps are completely unconstrained. Now, what ->>: [inaudible] the previous slide. [inaudible] you care about the tail? >> Mark Thomas: Yes. >>: So just that you want to get rid of the dependence in the middle part? >> Mark Thomas: Yeah. >>: Then what's the head? I'm kind of confused? >> Mark Thomas: I'm sorry. What's the problem, sir? >>: So what are the rows -- what are the rows of the error function? Go back. You're smashing -- you're saying that -- in terms of -- there's a call -- there's an error, a error vector. It's a column that's AG minus D, H tilde G minus T is the error, right? >> Mark Thomas: That's right. >>: So what does each row in that error vector, in that residual represent? >> Mark Thomas: That represents the magnitude -- so the amplitude of any one of the resulting taps from the equalization. So G is a vector that is then convolved with H forming -- but using this combination matrix H. And this then produces a vector which is the response. So if you were to then apply this to a speech signal ideally that would just be a delta function or or a delayed delta function. So all you would get out from the reverberant speech is the non-reverberant speech. So ideally we just want a delta. >>: So why do you have [inaudible] that says yes, I want a -- I want a -- yes, I want zero -- yes, I want one, yes, I want zero, I want zero, I want zero, I don't care, I don't care, I don't care, I don't care, I want zero, I want zero, I want zero, and I want zero, right? So why is it -- why is there talk, why [inaudible] equal one? >> Mark Thomas: Okay. So the reason why tau is not equal to one is because we are working with finite lengths of impulse response. Now, let's -- let's say that we were to taking the single channel case. It's not possible to form a complete inverse. It's only possible to form a least squares inverse. And if we have a finite number of taps, and that number of taps is in the order of the lengths of the channel, it forms very tight constraints on the values of the taps. And the result is getting some very, very large taps that good morning far too large to be used for any kind of practical system. If we increase the length, say we were to increase the length of the equalizer to infinity, then the magnitude of those taps is no longer so great. So the -- you're always forced by this length criteria. Now, one way of getting round the -- this magnitude, this tap amplitude problem is to find a target function that is not a delta but a delayed delta. And this then relaxes the causality constraints. So let's say that we've got a direct path signal and then a reflected signal, that reflected gallon is always going to come later on. And what we really want to do is take that reflected signal and sort of time align it with the direct path. This implies semantic causality. Now, if we have a target function that is a delta with no taps before it, that [inaudible] can't happen. And the only way it can get round it is to use what little bit of information is in other taps to try to do whatever it can to get this target. >>: So it's really I want zero, I want zero, I want zero, I want zero, I want one, I don't care, I don't care, I don't care, I don't care. >> Mark Thomas: Precisely. >>: Okay. >> Mark Thomas: Maybe I should have explained in a little bit more detail about what this tau is meant for. There was a paper actually recently by an author whom I forget that looks at what happens with the filter gain. So this is the L2 norm of the filter. And looks at the gain as function of this tau. And it shows that the robustness can be improved by increasing the tau and introducing this non-causality problem. And by increasing this tau we can reduce the gain and then make it more robust. This RMCLS algorithm does it in two ways. It can exploit the tau. It can also exploit the fact that we don't care what these taps are, so it can then take whatever value it likes. And if we then take some Monte Carlo simulations where we've just taken the room and made some random locations of source and receiver and then look at the mean filter gain, we see that when there is no relaxation window at all, which is equivalent to the MINT approach, the gains are massive. They can be in excess of 50 dB. And for such a large amount of gain, any tiny amount of noise or system error is going to manifest itself as a huge error in the output. So while this could work very nicely in the no error, no noise environment, as soon as you come out with a practical scenario, this is not suitable. But by having this relaxation and allowing those taps within the relaxation window to take whatever value they like, the gain is significantly reduced. And even for a relaxation window of 50 milliseconds perceptually this is not actually, actually too bad. But it allows us to reduce this gain a long, long way and therefore make it much more robust to real world deployment. I'll give you an example now of what happens when we introduce a little bit of error. So we simulated a room of 10 by 10 by 3 meters with a T60 of 600 milliseconds and placed two microphones within there. We forget about noise for now, but instead we add Gaussian errors to the channels H to yield the mismatch between H and the perturbed system of minus 40 dBs. This is a very small amount of error. On the top left is the channel one. And here we can just about make out these small taps due to the low-order reflections. And this is then followed by the reverberant tail. The second is the equalized impulse response on the output of the MINT algorithm. And although we can't see it here at the zero tap, so here tau is zero, there is a value of one or very close to it, and this is followed by lots of very small taps. So it's done what it should do by suppressing those taps. In the third and fourth case are two examples of channel shortening, RMCLS and another algorithm that is based on maximizing the Rayleigh coefficient. I won't talk in any great detail about that now. But in both cases, the relaxation window has been set to 50 milliseconds. And we can see that these taps are taking any nonzero value they like followed by a very much more attenuated tail. Now, if we then look at the energy decay curve, which is a measure of how the energy in the impulse response and in the third -- the last three cases, the equalized impulse response, we see that in blue is the decay of the first channel. The MINT output is shown in red. Now, it's a bit deceptive because in B it looks like there are lots of very small taps compared with the direct path. But actually if you look at the energy decay curve we see that it has actually increased the level of reverberation. So rather than performing D reverberation it's just replaced it with another kind of reverberation. So it's failed in this case to achieve anything useful. And I'd like to remind that you this is actually with minus 40 dB error and no noise. So it's a really [inaudible] scenario and still it's gone wrong. But if we then look at the RMCLS and other channel shortening approach, we see that there's a rapid decay within the first 50 milliseconds and after after that point the energy level is very much below that of the original channel. So in this case, the channel shortening approaches have performed a deep reverberation. >>: [inaudible]. >> Mark Thomas: Yes. In this case, it's equal to one, yes. And we mention this in a paper that I think I'll plug in a minute for the forth coming WASPAA conference. It gives some more experimental results on this and looks at what happens when we take lots of different scenarios with varying SNR and varying system mismatch. And there are audio examples that I'm going to show in a minute. >>: [inaudible]. >> Mark Thomas: This is synthetic data at the moment, yes. >>: [inaudible]. >> Mark Thomas: With the image method, yeah. But we can take the real world scenario. So we took a real room of five by three by three meters with a T60 of just under 200 milliseconds. So this is like a typical office environment. We them performed a supervised system identification for two channels. This was done using the maximum length sequence approach. And it's using the same setups. We had a pair of microphones and a loud speaker. Using the same setup as we did to estimate the impulse response we then played some speech through the loud speaker, recorded it with the microphones, and then attempted to use these algorithms to perform a D reverberation on the recorded signal. Now, there is always going to be some degree of error in the system identification because of noise and finite length errors and all that sort of thing. There is also some amount of noise within the room due to the air ducting. And there were people there sort of shuffling around. And just as an additional objective measure, we used the PESQ approach. PESQ is an algorithm that gives an estimated predicted mean opinion score, just as sort of rough and ready approach to give an objective measure as to how well this has worked. So I'd like to play the clean speech signal. >>: In language, infinitely many words can be written with a small set of letters. >> Mark Thomas: And then the recorded signal. >>: In language, infinitely many words can be written with a small set of letters. >> Mark Thomas: Unfortunately, once again, with this room being a bit reverberant, it's not very easy to tell you. And on head phones you can tell the difference. But what should be noticeable is then when we take this approach and then apply it to the MINT algorithm. >>: In language, infinitely many words can be written with a small set of letters. >> Mark Thomas: So the reverberation is certainly increased. And if you listen on head phones, there's a lot more noise in the background. So it hasn't performed any kind of dereverberation. And I'll just play one more with a channel shortening of 12 and a half milliseconds, which is pretty short. And this yields a per perceptual MOS score of 3.1. >>: In language, infinitely many words can be written with a small set of letters. >> Mark Thomas: And if we compare that again with the recorded. >>: In language, infinitely many words can be written with a small set of letters. >> Mark Thomas: I hope. >>: [inaudible] clean speech? >> Mark Thomas: Yes, clean speech. >>: In language, infinitely many words can be written with a small set of letters. >>: So remind me again, LR is the amount of the delay? >> Mark Thomas: So L ->>: [inaudible]. >> Mark Thomas: Okay. So LR is not delay. LR is the length of the relaxation window. It's the number of samples at the beginning of the target impulse response in which we say take whatever value you like. Don't try to attenuate it to zero, just take it to whatever value. What happens is when you increase the LR a bit further, some degree of spectral distortion is introduced because the low order taps are then being included in the [inaudible] this is spectral distortion. So maybe you'd like to listen to that as well. >>: Language infinitely many words can be written with a small set of letters. >> Mark Thomas: We missed the beginning. Let's play that begin. >>: In language infinitely many words can be written with a small set of letters. >> Mark Thomas: And then compare that with the best case. >>: In language, infinitely many words can be written with a small set of letters. >> Mark Thomas: There's a small spectral tilt. There's a bit of high frequency information missing when we make the relaxation window too long. >>: [inaudible]. >> Mark Thomas: And there are some more results in the forthcoming WASPAA paper on this. And if anyone's interested, then we can talk about those a bit later. >>: [inaudible]. >> Mark Thomas: In this case, it is the optimal. The optimal will vary depending on the room and the amount of mismatch in the system and the amount of noise. And the -- can I show you a slide? A, yes. So this was a hidden slide. This shows the case where we took a simulated room and performed some Monte Carlo simulations where we had a case that only considers noise and only considers channel error. And the noise was varied between 10 dB SNR and an infinite SNR and the channel error was changed between minus 10 dB, to minus infinity. So the minus infinity channel error is the best case. And I think we can just about make out here on the bottom is the dash line. And this dashed line is the unequalized response. And there are other dash lines to show the equalized response. The unequalized response for the different noise scenarios. On the X axis is the length of relaxation and on the Y axis is the perceptual MOS score. And what we see is that provided the SNR is in excess of about 30 dB, which I think is about reasonable for an office environment, then you can benefit from introducing the channel shortening. If you do no channel shortening at all then this drops very, very much. And most of the time you're actually below the dashed line. So there's no point in performing the MINT algorithm on those cases. Only in the case of no noise can the channel shortening -- sorry, the MINT algorithm produce the best score. And here four and a half is the highest score which says that there is no error in the equalized signal. Interestingly, on the right-hand side is the case where we consider no channel error and the dashed line is therefore in the same place all the time because there's no noise. We see that actually the algorithm is more sensitive to a certain number of dB and additive noise than it is to a certain number of dB of channel error. So while there are lots of papers that look at channel error, the real killer here is actually the amount of noise that's been added not the channel error itself. And it also shows that I in every case in we introduce a small amount of channel shortening then there is some empirical optimum length, and it seems to vary somewhere between naught and 20 milliseconds. So on the whole I think in every case if you pick a channel shortening length of about 10 milliseconds then that's in the area of this empirical optimum. But some more analysis needs to be done on this to find an analytic expression to the expected performance as a function of noise, it's a function of channel error, and a function of the shortening length. So if there are no further questions on this section, I would like to summarize by saying that perfect dereverberation can be achieved by the MINT algorithm, provided the channel is known with no error and provided there's no noise. If we introduce an error or noise then this is is there detrimental to performance. And we can approach the problem by using the channel shortening technique that appeals to cycle acoustics and says that the ear cannot distinguish low order reflections and therefore we aim to equalize only the reverberant tail and to leave those low order reflections alone. This achieves improved robustness because the gain is reduced. And it then allows it to be applied to real world recordings. So the final algorithm I'd like to talk about in dereverberation applies geometric modeling in addition to the propagation modeling that is applied to beamforming. Now, in overview, the idea of a beamformer is to coherently sum or filter and sum or sum and filter the direct path of component of the multi-microphone observations. It's this direct path that we want to get at. If we then consider the image model of a room, we can say that if there is this room which is in solid black and then inside it is the receiver or an array of receivers by circle, and a source in the green -- in green on the X, the reflections produce image sources outside the room in red here are the first order image sources. In this 2D case, there are four and you take an extension to 3D where we consider the floor and ceiling there will be six first order reflections in this shoe box environment. There are also high order reflections where reflections have been reflected and this produces image sources that are further away, and this produces the reverberant tail that we talked about in the previous section. So if we can identify where these image sources are, is it then possible to use them to our advantage? Can we actually then perform some kind of coherent averaging of not just the direct path signal but the signal from the low-order reflections? And this approach is called a rake receiver. A rake receiver is a technique used commonly in communications where reflections are hills and tall buildings are timely aligned and then coherently summed with a direct path signal. What we want to do here is apply the same approach but using acoustics. Now, this section's a little bit shorter because this is a preliminary investigation, so I won't go into quite so much detail. But the idea is that if we take this stylized image of a single channel as the large direct path tap, there are some small parse taps caused by the low order reflections and then followed by the reverberant tail. If we take the single channel and then align it with the first order reflections and then sum, it should be them possible to reenforce the direct path with this sort of pseudodirect path caused by the image sources. And here we've shown the single channel case, but it should then be possible to then take the multichannel case and align the direct path and first order taps for all of the channels. So let's just look at what a delay-and-sum beamformer does. In this case, we're looking at it in the frequency domain. The steering filters can be thought of as a complex exponential, including a term tau M, where tau compensates for the propagation delay from source -- a source to microphone M. And there's some weights which just presume -- preserve some amplitude criteria. The response is given by the product and sum of steering weights which include this complex exponential and the transfer function from the source to microphone M. And then when summed over all, we then get the transfer function of the delay-and-sum beamformer. And the idea of this complex exponential is to cancel the positive complex exponential that caused by the delay in propagation. So all we're left with then is not a delay but just some amplitude term. But at the moment, these taus are a function of M alone. If we consider the rake case, then we need to think about more sources, which we'll call lower case R. An R, if we limit ourselves to the first order reflections, is the capital R is seven because there's one source within the room and then six image sources on the outside of the room. So tau then becomes a function of both R and M. And then the response is very similar except it includes this additional term that sums over all channels and all sources. So it's natural then to ask what the expected performance of this rake delay-and-sum beamformer is. Well, one means of doing this is to use the direct to reverberant ratio. And this is a measure of the ratio of the energy in the direct path to the energy in the reverberation. And this is very closely related to the directivity index which is sometimes used in the context of beamforming. Now, it's fairly straightforward to find the direct path amplitude as it can be used -- we can use a 3D Green's function for this, where here we've replaced the taus with the distance D. And the omega with a weight number K. The product KD is actually equal to the product of of omega and tau. There's also term for the reflection coefficient which is one for the direct path signal and less than one for signals emalternating from the image sources outside the room. There's a slight similarification in here that we're assuming that this beta is not frequency dependent but in the real world scenario there look some frequency dependence on the reflected sources. But we don't consider this right now. For the reverberant component it's a little bit more trickiy, as we don't necessarily have an expression for the expected value of the reverberation. But what we can do is appeal to statistical room acoustics that allows to find the expected value of the cross-correlation between one microphone and a neighboring microphone. And this is a function of the area of the walls. The average absorption coefficient and the spatial location of the receiver relative to another receiver. And now that we have this expression for the expected value of the cross correlation and we have an expression for the expected DRR, we can work this through, and then it comes up with this rather nasty looking expression. But there is some validity in this as has been shown with some experimental results. And what we did was take a room of four by five by 6.4 meters. You simulate it using the source image method using a T60 of half a second. So it's quite a large reverberant room where the field is quite diffuse. And it's this diffuseness that the statistical room acoustics relies upon. Within the room, we placed a linear array of M microphones spaced by 20 sent meters. And then ran 20 Monte Carlo runs with a random source receiver configuration. And on the bottom and on the bottom here is the output from the theoretical case with an expression the previous slide and the simulated direct reverberant ratio using -- from the output of the beamformer. And it's plotted as a function of the number of microphones and the number of sources. And what it shows is that unsurprisingly as we increase the number of microphones we improve the performance because we increase the amount of spatial diversity of the receiver. But also that providing we have walls and about two microphones, if we introduce additional sources, the output DRR improves, also. So without any additional hardware, this rake paradigms gives us almost for free, with the exception of some additional computational overhead an improved DRR. So the output of the rake delay in some beamformer is shown here in a case for six microphones and seven sources, where we see a very large tap caused by the [inaudible] here and some of the direct path and first-order reflections sources. Some anti-causal taps due at the time way in which the signals are aligned. And this reverberant distribution on the right-hand side. And this is really only a preliminary investigation, but it does show that there is some merit in performing this rake delay-and-sum beamformer to at least the simulated environment. So that's all I would like to talk about really on the rake delay-and-sum beamformer. And I would like to summarize by saying that reverberant environments can be modeled as a distribution of image sources and that low-order reflections form sparse impulses in the acoustic impulse response. And what we can then do is perform a temporal alignment of the direct path and first order reflections to form what we call an acoustic rake receiver. And this preliminary analysis reveals that improved dereverberation can be achieved compared with a conventional beamformer. But there's quite a lot of future work to be done. It would be nice to make an extension whereby we use in optimal beamforming as opposed to straightforward delay and sum. And it needs to be evaluated against alternative approaches such as the maximum SNR beamformer which in some sense considers all reflections based on some estimate of the noise space. So I'd now like to move on to the last topic of the talk which we call geometric inference. And this, like the rake receiver, applies some geometric modeling. Now, this falls into the category of acoustic scene reconstruction which aims to estimate some parameters of the acoustic environment. And specifically we want to know where the reflecting boundaries are based on the measurement of some acoustic measurements. And this has been shown to be useful for things such as source locationization and wave field rendering. And the way we do is this is we use a geometric approach based on the time of arrival of reflections. So if we look at the diagram on the left we can see a source RS and some receivers R1 through 3 and through the 1X is a single reflector. Now, I'll limit myself to the 2D case for this discussion and a just talk briefly about the extension to 3D at the end. But we can say that the time of arrival of the reflections is the composite time from the source to the reflector and the reflector to the receiver. And they looked at in -- at the acoustic impulse response the time have arrival is the absolute time at which the peak occurs. And the time difference of arrival is the difference between for example the first -- the direct path signal in channel one and the direct path signal in channel 2. Now, estimating these TOAs is actually a bit of a problem of the on the top left here is a supervised system identification of a real room. And a this is a fairly ideal case. We had a good loud speaker, a good microphone, low noise. But what we see is rather than having a nice defined peak, a sync-like peak as seen in the previous slide where we use the image method, we have lots of little peaks of some contact support caused by the impulse response of the measurement apparatus. And most of this was in the loud speaker itself. And this produces an ambiguity as to as to where this time occurs. The problem gets even worse when we consider and acoustic impulse such as a click. So here we've got a finger click. And this has got a slightly wider support but lots of little peaks and it becomes very ambiguous as to where these peaks actually occur. So it then becomes very difficult to pin point an exact time. So the way we approach this problem is to estimate the impulse response on the left with the supervised case and on the right with a click. And then identify the direct path signal. We then draw a window around the direct path, time reverse it and a then convolve it with the whole impulse response. And this is a matched filter approach. And the ultimate outcome of this is not to reduce the support but instead give a nice defined single peak that we can identify very well for the direct path and the first order reflections. And the same approach can be applied to the finger click case. So that's the first problem out of the way. The second problem is that usually we don't have exact synchronization between the stimulus and the recording. So we don't know exactly when the stimulus occurred. All we have is the time difference of arrival. But this is okay because we can use existing source localization algorithms based on the time difference of arrival that localizes the source relative to the array and then making some assumptions on the speed of sound we can estimate the time of arrival. So let's say we then have this time of arrival. What can we do with it? Well, a rather neat result occurs in that if we know the spatial location of the receiver relative to the source and we know that the time of arrival, then this parameterizes an ellipse and we know therefore that somewhere tangential to the ellipse is the location of the reflector. If we then consider multiple observations where the array geometry the known, we can then form multiple ellipses and find the common tangent and this common tangent corresponds to the location the reflector. Now, the way we solve this analytically is to define the line in homogenous coordinates. And it's this line that corresponds to the reflector. And this is characterized by the fact that if you then multiple this tuple with XY and 1, any point on the line is equal to zero. We can also define a conic with this matrix. And there are some constraints with defining ellipse and using this notation. And it can be be proven that a line is tangential to the conic providing that L transpose C adjoint L equals zero where C adjoint is dipped in [inaudible] as a C star. So then considering all of the channels we can form a cost function based on the square magnitude of this L transpose C star L. Unfortunately solving this problem is a little bit tricky because it's nonlinear squares optimization problem. And if we apply iterative techniques then it tends to become trapped in local minima. And only recently have we come up with a closed form solution by making some assumptions about the nature of the cost surface and taking slices through this through that cost. I won't go into the detail now, because we've not got much time left. The case of multiple reflectors is an additional problem because we've got all these TOAs, and they then need to be grouped somehow to find an individual reflectors. Because we've got these TOAs and we don't know which reflector it comes from. We have some techniques that don't give a global optimization because this is too big a problem and is also non-convex. But that's all I'll say about that now. But let's say we have a line L as a ground truth and we have an estimate of line L hat. We could evaluate it by looking at the alignment error which is just the normalized inner product between the line and its estimate. And this is proportional to the cosine of the angle between them. Alternatively we can just find the dispute angle between them. This can be misleading somehow because any parallel lines in space will give a low alignment error even if they're a long, long way from one another. So we have another error that we call the distance error, where we take the zeroth microphone and project it on to the line and project it on to the estimated line and look at the distance between them. So we can form some experimentation where we form a room of dimensions between three and five and four and six meters using the image method. We placed four microphones and one source in random locations. And for each room at the Monte Carlo, we formed five seconds of observation with Gaussian noise. Noise was then added to the observations to generate an SNR between minus five and a plus 40 dB. And so the first thing we did was then take these observations and try to estimate the channel blindly using the robust NMCFLMS algorithm, which is a frequency domain blind system identification. The time differences of arrival were then found by a group delay-based peak detector, much the same way as we used the group delay function to work out the location of GCIs in the first dereverberation algorithm. The source was then located relative to the array using these TDOAs. And then the proposed algorithm was applied all four walls within 200 runs. And the results are as follows: On the left is an example room where we have the location of the source. Four receivers placed randomly in the room. And we can just about make out here the common tangent of the blue line. And this sort of the common tangent of the ellipses forming the blue line, the common tangent of the red ellipses causing the red line. So this is an example of where the algorithm has worked well. And then on the right here is a plot of the localization accuracy as a function of input SNR. And then by ranking the estimated locations we can see that the best wall even in the case of minus 5 dB still does pretty well. And providing we have an SNR of greater than about five dB, we can estimate even the worst wall. And this is because the BSI algorithm is able to identify sparse taps very quickly. The amplitude of the sparse taps may not be identified very well and the amplitude in the location of the tail may not be very good. But providing we can pick out the temporal time, the time of arrival from the sparse taps, this is enough to then perform the inference. What we can also see in this dashed line is the source error so the localization of the source in meters and providing the SNR the greater than about naught dB, then it does okay. >>: [inaudible] impulses. >> Mark Thomas: Well, the impulse is over all frequency and up [inaudible]. >>: [inaudible] wide band signal? >> Mark Thomas: It's a wide band signal. >>: Okay. >> Mark Thomas: So it's wide band Gaussian noise that it was stimulated were. >>: So that's interesting. >> Mark Thomas: The blind system identification algorithm that we used, the robust NMCFLMS has been applied to speech. Now, although we haven't applied it here, a speech signal can be used to identify the transfer function, the impulse response, up to a agree of error. And given that it's the location in time of the sparse low-order reflections can be found fairly quickly it may be possible to perform inference on speech alone. And they -- there will inevitably be an increased error in the source localization and so that that would need to be looked at as well. And we have a paper actually in the upcoming WASPAA conference that looks at what happens when we have errors in the source localization and how they propagate through to the errors in the location have the reflectors. So this was a simulated case. We took a real world recording of a room of about five by six meters with a high ceiling and placed a mini microphone array in the corner made up of four microphones as shown here. I think they were placed at 16 centimeter intervals. We placed an additional microphone just slightly outside the line on a plane. And the reason for this is to resolve the front-back ambiguity that's inherent with linear array processing. We placed a loud speaker at one of four different locations and performed a supervised system identification using the maximum length sequence method. The ground truth is in the circle and the estimated is in the X. And I think actually the error is due in part to estimating where the speaker was. Because if you've got a bit of tape, there's only -- there's a finite region within the loud speaker and it's very difficult to know exactly where this equivalent point source comes from because this is not a point source. But nevertheless the algorithm -- the algorithm was applied and in green and blue are the estimated walls and the walls pass through the origin on the X -- on -- parallel to the axis. And this seems to have done its job quite well. And we can see for one of the measurements the -- in blue shows that the ellipses and there's no ambiguity here because the array is pretty much tangential, is pretty much normal to the wall. But in the case where the wall is parallel to the array, this front-back ambiguity comes in. And only one of the ellipses due to this fifth microphone in the middle analysis to disambiguate between the front and the back. But here it seems to have done its job. So I'd like to summarize this section by saying that localization of 2D reflectors can be achieved with a geometric approach. And that's -- it can be applied in very low SNR environments and to real recordingings. And this is an ongoing piece of work in the project that I'm working on at the moment. There is the question of robustness enhancement using the Hough transform and we have a paper at the upcoming EUSIPCO conference on this. As I mentioned earlier, there's the error propagation analysis due to things like errors in localizing the source and temporal errors caused by finite frequency and [inaudible] and such forth. There's also the problem of the extension to 3D where and ellipse is no longer the solution but somewhere on an ellipsoid and this makes it computationally much more demanding. But the theory actually falls in quite neatly from 2D to 3D. And also is the application in wavefield rendering. Now, I've not been involved in this myself but within our project we've used the knowledge of the source -- sorry, knowledge of the array of sources in wave field rendering relative to a reflector and then used the image sources as part of the wave field rendering. And this has been achieved using a legal world scenario. So I would now like to bring my talk to a conclusion. And I'd like to summarize by saying that there are many algorithms that use multichannel observations that don't rely exclusively on beamforming techniques and that some multichannel algorithms can outperform single channel algorithms applied to the output of a beamformer. But beamforming remains an extremely important component of the algorithms discussed. And so we would like to think of these algorithms as complementary as opposed to replacement for beamforming. The dereverberation algorithms that we've discussed exploit multiple channels by appealing not only to wave propagation modeling but to voice modeling, to channel modeling, and to geometric modeling. And finally that the estimation of room geometry from multichannel operations can be achieved for practical environments. And this is useful in many areas of acoustic signal processing. That includes my talk. I thank you for your attention. And I welcome any further questions. >> Ivan Tashev: Thank you, Mark. [applause]. >>: [inaudible] one, maybe two questions. There were a lot of questions during your talk. So questions? Thank you again. >> Mark Thomas: No further questions. Thank you very much. [applause]

>> Ivan Tashev: Well, good morning everyone, those who... in the lecture room and those who are watching us...

Related documents

Products

Support

&gt;&gt; Ivan Tashev: Well, good morning everyone, those who... in the lecture room and those who are watching us...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ivan Tashev: Well, good morning everyone, those who... in the lecture room and those who are watching us...