>> Ivan Tashev: Good morning, everyone. For me... speaker of the second talk of our mini seminar today. ...

>> Ivan Tashev: Good morning, everyone. For me it's a great pleasure to introduce the speaker of the second talk of our mini seminar today. Professor Walter Kellermann. I don't think that I have to do long introductions here. He is well known as valuable contributor to signal processing. We know that he's currently chairman of the technical committee of audio and electroacoustics IEEE signal processing society. He's a distinguished lecturer in IEEE. But for many of us, of course he's just Walter Kellermann who brings new algorithms, new ideas constantly tries to improve this or that particular algorithm in acoustic echo cancelation by suppressing in microphone arrays constant participant of the conference which he love to participate and take part of. So without wasting too much time, Professor Kellermann has the floor. >> Walter Kellermann: Thank you, Ivan for this very nice introduction. And as all of you know, the main work is usually done by the students and this is the list of students I want to recognize first before I start my talk. And some of you will know Robert Eichner, and he promised to join us today. He's with Microsoft since about a year. And he is working on an acoustic signal processing, which is the first line of the title. The next one is of course password. But the main thing is of course the multi channel aspect of what we want look at in this talk. And I would like to start with a brief introduction, and I apologize already that it's not going too much into details in this talk. I would like to focus on the general scope of this area and also on some specific solutions of course in order to make you believe that something works at least. So the goal is pretty simple. It's an ideal seamless human machine interface, which means this users here in an acoustic environment should be untethered and they should be mobile and we foresee a number of loud speakers and a number of microphones to do some signal processing. And the signal processing should fulfill the following tasks, namely first it should reproduce signals for the listeners at their ears in the way they like it or we would like them to listen to. And the other tasks are basically to capture source signals and to determine the source precision. So you see already we have a reproduction problem and we have a signal acquisition problem and the affect that it's a problem is due to the acoustic environment here which creates feedback between the loud speakers and the microphones. We have reverberation and reverberation comes in two flavors here. Usually -- and John McDonald alluded to that already, of course we have the reverberation here when picking up the microphone signals and the second flavor is the reverberation that we hear when we are listening to loud speaker signals. And basically this is something which we mostly ignore, but I will come to that later. It's a perceptual thing that needs to be addressed in reproduction system, especially if it's high quality. And finally we have, of course, noise and interference and we want to address all of these three problems in the following. And one of the really problems is that in many scenarios you have all of these problems together simultaneously. So signal processing needs to look at it simultaneously. You all know many of these applications. I categorize them into two classes. One is basically hands free equipment for telecommunication and what I call here natural human machine interaction which includes of course speech recognition. And you find them in mobile phones, cars with us to computers but also for telepresence systems and this is an area which I especially like because you may have more money to spend on microphone and loud speaker arrays which is not the case for example in cars and mobile phones and of course then there is all these smart applications like smart meeting room, smart homes and sometimes you look at museums and exhibitions as well. The second class is audio communication, and this means here we look at equipment for stages and for recording studios and also virtual acoustic environments for example, virtual concert halls or teleteaching studios and with regard to the teleteaching studios you may think of musicians practicing in two different places together. This is a very nice problem. There you can the requirements are pretty tough and that's something to look at in the future which is very challenging. And of course there's a more secretive application which is surveillance and as you can see, I don't put too many subitems here for the same reason. Okay. Let me show you a few microphone arrays which we built, and actually when I joined the university in Erlangen we started with a project on microphone arrays for laptops with Intel at that time, and here you see a microphone array of eight mics which will be used for beamforming in the following and to illustrate the beamforming algorithms in the SQL as well. Then actually Robert Eichner worked in a European project where the environment got a little more challenging, namely it was hands free communication for emergency cars like fire brigade cars where there's a lot of noise, there's a lot of tension on the speakers and obviously this is only a reference microphone so this is not to be used in practical applications. So the microphones are here. We use microphone arrays also for interactive TV applications now in a recent European project entitled dietsit [phonetic], and what you see here we have a large array and you see logarithmically spaced sensors here along nine -- actually the whole array is about 1.5 meters. So that's for large screen TVs. And here you see another application for localization actually in a parking garage. I hope the image is not too dark for you so that you can see the array here and here. And this is meant for localization purposes and highly reverberant environments like parking garages. Another application of surveillance would be public spaces, and here you see another of microphone array with a total of six sensors and here we have a very small one which is also used for multiple source localization. It's only about ten centimeters of [inaudible]. >>: [inaudible]. That public space, how [inaudible]. >> Walter Kellermann: Basically you can use it for surveillance purposes so that you immediately can steer a camera to when somebody smashes a window, for example. It's mostly for bad guys, right. And of course I only show the civil applications here. Okay. Then I would -- I always like to show a picture which is a microphone array which I did not build. That's the one by Gary Elko here. And I always use this in order to illustrate what also could be done but has been forgotten over pass years. This is an array which basically is meant for communication between several auditoria and this was at the high times of [inaudible] when they built this 360 element array for basically picking up the sound in the auditorium and being able to broadcast questions to the other auditorium as well. So you did not need a person to run around and put a microphone to the asking people. So you could use this microphone array. And that was actually back from the '80s and during my Ph.D. time, more or less exactly 20 years ago, we digitized that by putting a DSP just behind each sensor. And it was one of the first generation floating point DSPs. Some of you may know on the reproduction side -- yeah? >>: [inaudible] array [inaudible]. >> Walter Kellermann: That was actually originally it was full bandwidth for audio, but we only use telephone bandwidth for evaluation. So it's eight kilohertz. >>: [inaudible]. >> Walter Kellermann: In that, yeah. >>: [inaudible]. >> Walter Kellermann: Oh, that was a lot of data actually. Yeah. We had cubic meters of hardware. You may ask Gary for the details. Okay. Then on the reproduction side we look into wave field synthesis for a few years now, an actually this work is mostly done by Rudolf Rabenstein which also with our chair and he's basically trying to recreate acoustic wave fields not just in a sweet spot but in an extended area, and this for example would mean here in this area which is enclosed by this rectangle. But you can have these loud speaker arrays also in that fashion. Basically you use these panels and each of these panels is essentially an eight channel loud speakers. So you have eight exciters behind these panels. And this was supported by the European union also within the project of Caruso a while ago. This is just for introduction. Now let me move over to the more generic stuff in terms of equation. And I would like to look at the fundamental problems that we face here for reduction and acquisition and then I would like to show some advances in our group mainly in the area of signal acquisition and that looks at multi-channel acoustic echo cancelation at beamforming. I have apologize for using beamforming. But we also look at blind signal processing and then you may think of all your higher order, higher order statistics. And we published a while ago a concept which we think is quite generic and we entitled it 20 com that can be used for source separation, de-reverberation and source localization and actually you can see it as a more general theory as incorporating advised filtering like adaptive filtering and beamforming, like echo cancelation and beamforming as well. And finally I will briefly look at source localization in the wave domain and then I will come back to one of these examples of microphone arrays that you have seen already. So let me first start with the way we describe these problems and basically what we assume is that we want to do linear signal processing on our two kinds of signals that are of interest for us so we have a set of reproduction channel, U, and we have a set of signals, set that we want to extract, and if you think of the inputs and outputs to this digital signal processing unit here, you've these loud speaker signals as outputs and the retrieved signals as inputs -- as other set of outputs, so we put those two in the output vector and we as this should all be linear signal processing, we assume that a set of input signals U and another set of microphone signals, Xs, processed as inputs and this G captures all the linear signal processing, which is of course multiple input, multiple output, then you can look at this G in terms of submatrices connecting these inputs, the reproduction channels and the microphone signals individually and so you have these four submatrices which we will look -- use in the following to describe the problem. And then the goal is essentially to provide the decided listener signals at the ears here and if you describe this in terms of linear systems theory, you see you have the convolution of the loud speaker signals with the acoustic environment, this [inaudible] system described by this matrix here an you have the additional noise. This W of course captures two signals per listener in our case. And on the other hand, we have the microphone signals X, which are basically a mixture of the source signals convolved by the acoustic environment, the impulse responses again, plus the acoustic echo resulting from the loud speakers transmitted via the acoustic environment plus the noise vector and X that's for the noise. And one of the problem or one of the reasons why this is a problem at all is that the acoustics can be quite nasty so all the elements in our matrices here, H, they are basically impulse responses and usually you characterize them by reverberation time and this is the time, the sound needs to decay by 60 DP if you refer to T 60 and this is in cast approximately 50 milliseconds and in concert halls it reaches up to one to two seconds. Whenever you want to model these impulse responses you take usually FRI filters with a number of coefficients that is typically T60 times the sampling frequency divided by 3, that captures basically 20 -- or enough energy so that you -- the model error is below 20 DB. These models are inherently non minimum phase, and if you look at tip impulse responses here and you look at pulse zero diagrams, you see most of the zeros are actually on the unit circle. So in this case it's an impulse response, we call it inner office with a T 60 of about 300 milliseconds at a sampling rate of 12 kilohertz, this is the impulse response and you have actually thousands of zeros on or very close to the unit circle. Which tells something about how to deal with this in order for example de-reverberate it or also equalize it. >>: [inaudible]. >> Walter Kellermann: Yes? >>: [inaudible]. >> Walter Kellermann: You have a flat delay from the source to the microphone. That is not minimum phase. >>: So it's not minimum phase? >> Walter Kellermann: Non minimum. Right. It's not -- it's non minimum phase. Any flat delay is non minimum phase. Okay. So if we look now at the problems that we have to solve, given these acoustic environments, then we of course in reproduction want to look at the desired signals at the ears. So basically we want -- what we want to do is we want to create a set of desired signals, WD here, which is of course related only to the reproduction channels by some matrix HD, and this is basically the desired matrix that we have to provide in the end and if we equate this to the actual path from the reproduction channels to the ears, then we have to essentially equate this HD to this term here and you see in order to provide this you can -- have to deal with the noise of course, you still have the noise at the ears, and as a handle to that, you can use the microphone signals. And of course as a second component you see that the path for you must be the desired path and this is in reality this acoustic part and the one such matrix that you have to condition the reproduction -- the signals that are upon reproduced this GVU. So we have essentially two subproblems. One is an equalization problem which you may look at as a deconvolution problem, so this unit must equalize this acoustic path in order to achieve the desired characteristic and the second problem is an interference compensation problem which basically means you have to use these microphone signals and create an estimate of the noise at the ears that can be fed into the loud speakers and of course this also has to account for this acoustic path. Obviously this is not so simple and because you have to blindly identify the acoustics here between the loud speaker and the microphone and also you have to identify the noise component at the ears, which is also non trivial. This is a real challenge, and all the active noise cancelation approaches we know so far always assume that they can measure the noises close to the ears. But this is in practice in our desired environment not possible because we want to be seamless and people should not need to wear any devices class to the ears. Okay. Let me briefly look at the state of the art in reproduction and just in terms of this description if you look at conventional stereo and multiple channel systems there is nothing like interference compensation and also equalization does not consider the real acoustics it only considers something which you may like as a listener, you boost the bases or change the amplification in various frequency bands so you have frequency selective gains, but no accounting for the acoustics. If you look at beamforming with loud speakers, that can be done either in a more conventional way and/or a super directive beamforming, also ultrasound based, but still there is no interference compensation so far and also there is often only a chorus room equalization for a sweet spot but not for a whole extended area where people might want to move while listening. What can be done and is done actually in some products already is that they use impulse responses for the desired characteristic already so that this -- there is a rough equalization already but there's no room characteristics inversion or equalization directly. The furthest we come so far is wave field synthesis and there are first attempts for interference compensation. So actually Alheim Kuntz started to work on this in 2004 but most of the time in practical environments it's still not possible to for example follow the time variance of the acoustic environment. And on the other hand, there are some results on equalization, but also only for time invariant room acoustics and Sascha Spors' on that for about four years now. But most of the time this is also even in wave field synthesis, just a set of impulse responses that are used for the reproduction channels in order to mimic sound artificial rooms without accounting for the real listening environment. So you can listen to sounds like in churches, but you don't really invert the local listening acoustics. Okay. There are quite a few challenges involved with that. One of -- is the equalization and the other is the interference compensation and both if you want to do it correctly perfectly require blind identification of these acoustics. And this is something which is still pretty far out of reach I would say. But it's of course a nice challenge for academia. One probably, however, is that many people argue we may not need that actually. We don't need that much spatial realism. We don't -- we may not really need all the identification problem which is highly sensitive to -- and we publish papers on that how sensitive that is. So it's really questionable how much spatial realism can we really appreciate? And this is one of the questions that is still open in this area and so it may be that's also one of the reasons why not so much effort is devoted to this. >>: [inaudible]. Even if you can identify the system perfectly though in its minimum phase you can't divert it, right? >> Walter Kellermann: Right. You have to have an estimate for that. But as the listeners are usually not really sensitive to all-pass filtering and you know any filter can be decomposed in all pass in a minimum phase system, then it would be sufficient to equalize the minimum phase part. So you cannot perfectly identify that in many cases, right? But if you have a good guess for the all-pass filter, that might be sufficient. Okay. Let me turn to the signal acquisition part, and this is where we are quite sure that it's reasonable to identify the systems that we need to identify. Here the problem is that we essentially want to extract a set of desired signals out of these acoustic scene and they should be undistorted in the sense that they should sound as if you put a microphone very close to the mouth here and removed all other effects. So if you compare this to what you really get of course you have these microphone signals, as I said, this is a mixture of the source components, the acoustic echo and the noise and this can be treated now by using your other inputs, namely the reproduction channels and of course this -- then you have these two submatrices chi zed U and chi zed X, this one to treat the microphone signals and to get what you actually want to get. This results in especially three subproblems. One is the acoustic echo cancelation problem, so where you essentially need to model this acoustic path, including those matrices here by this matrix and once you equate those two paths, then of course the acoustic echo is gone, then it's basically what you want to do. So you want to remove all components that are related to your production channels. The second problem is the source separation or empty reverberation problem that essentially treats the components resulting from the desired sources. So what you get here at the microphone is the set of -- the mixture of all the desired sources, each convolved with the acoustic path and pulse response. And you can treat that with the matrix chi zed X and what the result should be basically a flat delay of the desired sources. And finally you would like to suppress all the interfering noises and this means essentially that this matrix should nullify all the noise components. This looks very simple, but of course chi zed X has also to fulfill this and to fulfill this, so this is basically a problem that this chi zed X needs to satisfy all of them, all of the requirements. And of course one of the problems that you don't see so easily is that first of all, you'd have to separate all these various components in the microphone signals. And this is also to be done by chi. Okay. Now, in the -- after talking so much about problems, I would like to come a little bit to a few solutions. One is the multi channel acoustic echo cancelation. And here we start with an isolated single channel case in order to define the problem again. Essentially this is a textbook problem since 25 years it has been recognized as such and what you have to do is you find a filter which MMIics the impulse response via the loud speaker, the acoustics and the microphone in order to remove the components of U. You do this as you know this acoustic environment is time variant by an adaptive filter which usually approximates the VF solution. Sorry for using second order statistics here but this is quite appropriate and has been working for many years, also, we actually use some higher order statistics in robust filtering as well. But for here it's sufficient to assume we have adaptive filters that approximate the VF that solution. And why is this a problem at all? Why do people talk about it, and I've seen a talk about Microsoft people at eye wink, the problems on implementing this. It's a long filter, and typically to get 20 DB echo suppression in a conference room at 12 kilohertz sampling rate still asks for 1,000 or more coefficients. In the living room it's easily 4,000 coefficients you would like to realize in order simply to mimic this acoustic or model this acoustic path. The multi-channel case is a little more complex because of two reasons. The one is you have of course obviously if you have K loud speakers, you have K times more filter coefficients to adapt which makes the problem computationally more expensive but the real problem for academia is of course that the or correlation matrix here which is decisive to the adaptation speed and the adaptation behavior of the adaptive filters is worse conditioned than in the single channel case. And I tried to illustrate this here by this the matrix in the single channel case you see a diagonal dominant matrix and this matrix is the auto correlation matrix of this state variables of the FIR filter. And usually it's at least diagonal dominant because also speech has a decaying correlation over it, so this would be -- look like this. If you have three channels, most of these channels are correlated so you can easily match in stereo of 5.1. The individual channels are strongly correlated. And then the auto correlation matrix of the set of all input vectors for all these filters looks more or less like this. And if you just [inaudible] there's a lot of regularity and you find columns and rows which are very similar because the diagonals in the off-diagonal matrices are still similar to the main diagonal, so this matrix is usually pretty ill conditioned. There are couple of solutions to these two problems regarding the cross correlation between the channels. There are basically only bad solutions. So whatever you do, it's somehow affecting your loud speaker signals because correlation is something you cannot remove bilinear [inaudible] invariant processing, so you have choices either you at some non-linearity here and of course this must be inaudible. And unfortunately the more inaudible it is, the less it helps, right. But that's quite natural. And you can add noise in the same holes, right. So you should use it below the masking threshold of the ear so that people buying very expensive audio equipment, $20,000 loud speaker equipment, they should not hear any noise, right? But it still must be strong enough to de-correlate the signals, which is a dilemma, right? The third one, and we think this is the best solution, is to introduce time varying all-pass filtering which are tuned such that the perceived sources don't move in space. So you have different all passes here, they are time varying, so actually they slightly move the sources if you have a stereo reproduction you would think that if you change the faces of the signals the sources start to move. Actually they do, but they do it only to the extent that you cannot hear it perceptually. So your binaural hearing is not as high -- resolving that high that you can hear that. And this is actually done with the audio coding people, so you see Uden Harris [phonetic] name here who was an intern at that time with us and then we developed this method. And we think and we publish that actually at ICAST 2008 or 7. 7. This works actually quite well. So this is one way of playing dirty tricks to resolve this correlation problem. And here is the solution or at least one solution to the complexity issue you essentially go to the DFT domain and that results in this case of in the fact that rather than inverting huge matrices, you only invert many matrices of small size. Think of five channels and 4,000 coefficients in each channel. So that's a 20,000 by 20,000 matrix is not to be inverted but you rather invert 4,000 matrices of size 5 by 5. And this is essentially what you do in realtime. So if you look at the results, then you feel that you can obtain reasonable convergence, much better of course than an LMS, here is a convergence curve for the system error nodes which tells you how well we identify the real impulse responses and you get a reasonable convergence with this generalized frequency domain adaptive filtering which Herbert Buchner and others proposed. And this is here for two to five channels over time you see the convergence is pretty slow because this is seconds. What you really here is actually the echo return loss enhancement, and this is the suppression of the actual echo which converges much faster to a level of about 20 DB and two seconds. Now, it's needless to say that we claim that the production quality with our time varying filters is still much better than the others, and we actually verify that by [inaudible] tests but I would like to play an example here. And I hope the audio works now. So this is a microphone signal and it actually is recorded for distance speech recognition task. Some of you who know German will identify that. [tape played]. >> Walter Kellermann: This is microphone signal. [tape played]. >>: [inaudible]. >> Walter Kellermann: That's not you? >>: No, it's not me. >> Walter Kellermann: Pardon? No, no, it's Herbert Buchner. [tape played]. >> Walter Kellermann: Sorry. Okay. That's basically the effect you could -- I hope you could hear the convergence in the beginning and you get down through easily through a DBB of echo suppression -- echo cancelation. There's no post filter involved for those of you who -- post filtering you can do better even but for at this -- in this application it's still a good [inaudible] pardon? >>: [inaudible]. >> Walter Kellermann: There's no suppression. It's just the filter, it's just the five channel filter. So you could even do better of course but you would not hear that in this application. >>: [inaudible]. >> Walter Kellermann: Sorry? >>: [inaudible]. >> Walter Kellermann: [inaudible]. This is simply a perceptual test with -- where you take in this case about a dozen people to listen and you have hidden anchors so you have ideal reference signals and then you all rank relative to that . Okay. There is a real-time implementation running on a PC, and this is actually quite old already, so it was about five years ago when we implemented that first on a PC. So nowadays PCs are much faster and -- but still at that time already we had -- could realize 25 coefficients. So what we are doing right now is a little more challenging. It is going from five channel to more, and in this case you see a micro -- a loud speaker array of 48 channels for array field synthesis and Herbert Buchner actually published generic concept for wave domain adaptive filtering in 2004. Which essentially transforms the problem into this -- another domain, namely the wave domain. The idea is here that you do not identify the individual channels from each loud speaker to each microphone but you decompose the wave field which is generated by the wave field synthesis into the harmonics, in this case cylindrical harmonics of the space, so it is essentially special functions in the end. And this transform comes first and then you go to the DFT domain and basically what you do is say in a setup with 48 loud speakers and 48 microphones rather than identifying 48 squared adaptive filters for identifying all the elements in this acoustic matrix, then you only need to identify K in this case 48 acoustic passes for the according for the wave field decomposition in terms of the cylindrical harmonics. So essentially you bring down the computation complexity from K squared to K -- actually a little more than K but that actually works quite well. We had papers in the early stage and simulated data but we are now about to implement this in real-time hardware as well. So nevertheless this is an ideal environment in the sense that the geometry of the loud speaker array is quite simple and nice to handle because then the cylindrical harmonics tell you how to do this, how to implement this transform but the real challenge is to go to say listening spaces like your living room at home where you don't want to have a cylindrical array or circular array you want to build the loud speakers into the walls and then you need other transforms. And that makes it really hard. So that's still unsolved. But that's something to look forward to. Then the second topic and I hope I don't run out of time is the second topic is on beamforming that we use for signal extraction and interference suppression and I have to say we are not explicitly looking for speech recognition application here but we did use it for a speech recognition as well. Basically if you look at it, it's simply a signal separation task. So you have this box here and you would like to separate those signals and that means you can't use three domains, either time, frequency, or space. Of course space requires more than one sensor here. If the origin of the signal is known and that may be either in time frequency or space then there is separability is only determined by the aperture width, basically time frequency resolution or spatial resolution. That's what determines how well you can separate the signals. But -- and if the origin of the signal is unknown in these three domains, then you can talk about blind signal separation. So reducing this to the spatial dimension, you have the supervised beamforming for the case that the spatial origin is known and you have for the case that the spatial origin is unknown the blind beamforming a blind source separation. And actually both terms have been used by blind source separation sound more fancy and then most people forget that it's essentially blind beamforming. Let me start with supervised beamforming briefly and it's what we see here is just an example for a signal independent filter in some beamform which in the conventional way which should just be understood as an example what you want to achieve. It looks at zero degrees so that's the steering angle and it's eight microphones and actually this can be implemented directly on the loud -- microphone array as you have seen it on the laptop. The spacing is four centimeters and what we used here was a [inaudible] design above a certain frequency which was about 1200 hertz here and then from then on you have a nearly constant beamwidth main lobe and but at low frequencies you see the spatial resolution deteriorates drastically. Please note this is sensitivity, this frequency, this is space and angle arrange. And you see the low selectivity with the low frequencies. In the conventional case. The way to overcome this is often seen as a different way of beamforming but essentially it's not a different way, it's just providing different filters to the individual microphone channels and then you talk about super directive beamforming, it's still signal independent and it's still signal independent filters and for the same geometry you see on this heat plot that you can get a much better resolution at low frequencies while still preserving the constant beamwidth around zero degree here. This example actually is due to frequency invariant beamform proposed by power and the problem here is actually that in the low frequency region you get very sensitive to incoherent noise and calibration errors of few microphones. And actually that's not on the slides here but we recently developed a method which incorporates basically constrains on the sensitivity here into, directly into the beamforming design that for a given sensitivity you can in one shot design the beamforming perfectly. The third kind of beamforming I would like to put forward here is actually a robust version of the generalized [inaudible] canceler which we used for speech recognition as well. And actually it was not mentioned on your slide, so it would be nice to compare the recognition results with yours. But it -- as John already mentioned, basically what you do here and you may remember this plot diagram, is constrained beamforming with a minimum squared criterion, so it's a second order statistics based criterion in its original version, but I can easily use other optimization criteria with a problem of estimating the higher order statistics. The basic idea is here. You minimize the output with the constraint that the desired source should not be distorted and for that and the clever idea of the generalized sight [inaudible] is actually that the two problems are decomposed so you have the constrained realized up here and you have an unconstrained adaptation down here so that makes it really efficient. But what's important in the acoustic domain is that you adapt the blocking matrix as well to get good quality. The blocking matrix namely has to provide an output signal which is a good estimate of the noise. And if the desired source moves here then it's important that the this -- that this blocking matrix does not allow any leakage of the desired signal because this would -- the interference canceler, which is placed here, would then cancel out the desired signal from here as well. This is actually quite a problem which you need to avoid. So this is why I write here the adaptive blocking matrix is actually quite important. And in order to make this converging quite rapidly even under double talk situation so we're both interferer and desired source are active, it's desired to use an adaptation in the DFT domain again and I will play you an audio sample for this here. So first you hear a signal microphone with a two sources, one is the interferer and it's at an SIR of 3 DB here. And it's actually the [inaudible]. [tape played]. >> Walter Kellermann: So let's listen to the output of the fixed beamformer. And please note that the low frequencies usually cannot be suppressed well, because this here is the fixed beamformer like you've seen it a couple of slides ago, this one, this one, right, and it does not suppress low frequency content well. Okay. [tape played]. >> Walter Kellermann: And now finally the output of including the effect of the interference canceler, which should reduce the low frequency interfering noise, interfering speech as well. [tape played]. >> Walter Kellermann: Okay. And if you measure this you have an SIR gain of about 17 DB here. But as you could see already, there is still a challenge here especially if you go to highly reverberant environment. And so as John pointed out already, this way of -- this version of beamforming does not cope completely successfully with reverberation if it does not have enough degrees of freedom by way of many microphones. So let me move over to the blind part of the -- of this talk where -- and I will folks only on the 20 com concept which is being developed in my group. And first of all I'd like to look at source separation and then briefly at de-reverberation but also a localization and especially localization of multiple sources and even in multiple dimensions. And we'll see in a minute what this means. So first of all, the blind source separation scenario is given here so you see multiple sources, multiple microphones. We assume that the number of sources and the number of microphones is equal and we are looking for a demixing matrix here and chi zed X, that's actually the same building block that we had in the previous but now it's meant for demixing. And what you essentially want is you want to extract the desired sources here, so if you ideally achieve that, that would mean you had to invert these acoustic mixing matrix and then you would get simply a delayed version of the original signal. This is not what blind source separation does actually, actually blind source separation does not blind deconvolution, rather it only tries to minimize the mutual information between the output in the best case. So the strongest criterion that you can fulfill for blind source separation is that the minimum mutual information between the outputs is minimized. 20 con actually, 20 com provides a framework which essentially describes this criterion, the minimal mutual information criterion and it is especially designed for non-white stationary and, non-Gaussianity of the source signals. And if you look at the cost function for BSS here, it's actually 20 com more generic and can be used for other like purposes like localization and de-reverberation as well. But if you look at the cost function for BSS, you try to minimize the mutual information for between these outputs by looking at dense multi-variant densities of each of these outputs. So here, for example, you have the luck of estimated D dimensional densities for each output, so just imagine you take D variables here and you look at the multi-variant density of these D output samples of one channel then you multiple all of these D multi-variant outputs so that means the independence of the individual outputs and you contrast this to the joint density over all the output values over all the channels. This is basically the idea. So of course if you write this down as a cost function, you don't have an algorithm yet how to achieve the best filter coefficients. But this is basically the underlying idea for BSS. And there is some averaging here which allows you to cope with non-stationarity and also here you have a recursive averaging. So -- yes? >>: [inaudible]. It seems like this would be, you made the assumption that you only have two sources but it seems that that approach would be super sensitive to noise if you [inaudible] a little bit of background noise it would mess this up. >> Walter Kellermann: I did not mention that making these probability densities is of course very difficult in general. We all know higher the order of the moment you estimate, the more sensitive it's to noise. And of course if you just think of estimating [inaudible], it's a fourth-order moment but the fourth-order moment already has a variance of the estimate to the eighth power, right? So you're exploding insensitivity, right? You can use parametric models of course, but this is still the generic idea. So whatever you can put into that as models that will help and we will go -- come to this. But this basically is the starting point, and everybody is basically derived from that by specialization and the first and simple case that you would look at is second order statistics which essentially means you assume Gaussian densities. Also you may know that these are not representing speech signals well but nevertheless it's a way to start. And then you know for example that these high dimensional densities can be captured and entirely described by correlation matrices and that helps. Okay. So then you -- the -if the source model is just a multi-variant Gaussian then you have simply the correlation matrix to consider and that also actually leads to a relatively simple adaptation algorithm, the natural gradient for example can be computed from the cost function then. So if you reduce the cost function to second-order statistics you can compute the natural gradient and you have a relatively simple update rule. And if you look at this here, it actually means you have to invert a matrix very similar to RLS, to recursive least squares. To illustrate what we do here is you see these correlation matrices for the two channel case. Actually whatever we said above holds for the P channel case but this is for the two channel case. So if you don't do anything and you just look at the observed mixtures at the microphones, you would see correlation matrices like this for each channel and you would see cross correlation matrices like these of diagonal matrices. And these are of course of the dimension that you foresee for your demixing filters. So if you foresee the mixing filters of length 1,000, then you still have 2,000 by 2,000 matrices here. And what this algorithm would try to achieve is to remove the cross-correlation between the channels which actually means to nullify is off diagonal matrices. If you want to de-reverberate the signals a little, been you would remove also the outer correlation part of these correlation matrices here, assuming that the inner-correlation is due to the speech model and the outer part is due to the impulse response of the filter. And this is what we call partial deconvolution multi-channeled blind partial deconvolution here. And if you completely want to whiten your output signals, then you would perform a compete deconvolution that's here and multi-channel blind deconvolution then this matrix would be diagonal again. There is a real-time implementation actually since 2004, and we continuously worked on improving it. It's for P equals 2, two channels on a PC and Robert Eichner, I saw him a while ago, he's gone in the meantime, but he's the one who supervised the master, he's the student who implemented that in real-time. And there is still a lot to be done, and it's no secret anymore that we work on that for hearing aids where there are heavy constraints but even without the heavy constraints of this application, the reverberation is still a problem and especially the case where you have more sources than sensors. If you think of a cocktail party effect where you actually have only two hearing aids in two ears which may be linked binaurally but you still are not able to deal with 10 sources. So there's still a lot to be done, but there is some progress under way and we will publish very soon on that. >>: Out of curiosity, have you ever tried to -- have you ever tried to [inaudible] that you would have, you have a Gaussian circular process [inaudible]. >> Walter Kellermann: Yeah. You will see this in Herbert Buchner's Ph.D. thesis. >>: [inaudible]. >> Walter Kellermann: Finally I hope. [laughter]. Okay. Anyway, I would like to give you an impression of what we can achieve for the three channel case actually. So as I said, 20 com is a relatively generic scheme, and whatever is -- and this is actually only one branch which already makes the assumption that we can describe all this sources by [inaudible] which is actually what you mentioned by the [inaudible] chi function. So this is a way of creating multi-variant densities based on single-variant densities. And extending that relatively simply. And then you have -you see you have the one branch for the multi-variant Gaussians an you have another branch where you have univariant PDFs and this is basically branches where you see all the most of the common BSS techniques aligned. So this is actually a more generic scheme which has these special cases. And Robert Eichner actually developed a relatively efficient version for multi-variant Gaussians which I want to play here. You will see the sensor signals or listen hear the sensor signals first. [tape played]. >> Walter Kellermann: And if you listen to the output of the BSS scheme. [tape played]. >> Walter Kellermann: I guess you could hear the initial convergence phase a bit. Let me play the other one. [tape played]. >> Walter Kellermann: The second time I don't play the female, but anyway, what you get if you measure that, the performance, it's about 19 DB in separation gain and one of the important features here is, which I like to emphasize you basically have no distortion of the desired source. So you can -- there is a slight coloration, but there is no distortion like you have it usually in noise suppression techniques and spectral subtraction. Okay. So let me move over to de -- reverberation, and de-reverberation is here considered -- or depicted as a single-channel problem. We actually look at it in the multi-channel problem -- I mean the multi-channel setting, but the main idea is here real that you consider this speech signal as being a correlated signal where the first part of the correlation is due to the vocal tract and the rest is due to the room characteristics which means you would like to preserve these diagonals in the cross-correlation matrix and you would like to get rid of the outer diagonals. That's a relatively simple idea, but it's relatively hard to put this into an efficiently adaptive algorithm. So because it asks for another constraint in the adaptation. Nevertheless you can do this and please note as in the blind source separation context we use filters of length 1,000 at a sampling rate of 16 kilohertz here, we have again two source, two channels, and the nice thing about this scheme is that it of course does separation and de-reverberation at the same time. And actually the separation is even improved relative to the [inaudible] algorithm and the de-reverberation is also clearly stronger than with for example a delay in some beamformer, but of course if you only have two makes the delay in some beamformer will not give you much. I'm not sure whether you can hear the effect in this room because the original signal is not too reverberant either but let's try it. [tape played]. >> Walter Kellermann: Okay. So basically this is how far you get within any extra measures, without any subtraction and post filtering or equalization of the spectrum. This is really what comes out of the 20 com algorithm. And of course if you feel that there's any coloration, you can still equalize that. Let me look at localization briefly and I see I'm almost running out of time, but we're getting close. Okay. We can use that blind scheme for localization of multiple sources as well. And the idea is essentially again relatively simple because if you consider this source here and the effort that is done by the demixing matrix in BSS, it means that if you want to separate these two sources from each other and you want to get S 1 here, you need to identify filters here in these two passes, which essentially are equal to these two channels. So this acoustic channel should be modelled by this one and this one should be modelled by this filter. So if you look then at these filters and look at the main peaks you will find the direct acoustic path. And the peaks of the acoustic paths tell you the relative delay between the two signals and that is an indication of the direction of arrival and then you can basically find the local -- the source location. That is simple as long as you have only two sources and it can actually be seen as a generalization of the adaptive Eigenvalue decomposition algorithm that Vanisty [phonetic] published in '99. If you have more than two sources then you can compute something like directivity patterns at the output of these -- of the BSS scheme and these zeros that you will find, they point to the suppressed sources. So you only average over several directivity patterns and you get pattern of all these suppressed sources and these suppressed sources will tell you the locations. So that's basically a DOA estimator. And that works in the overdetermined case as well, so that's the case where you have more sources than you have microphones. You can also generalize this to multiple dimensions because if usually localization is only seen in a plain is a problem, but of course sometimes you want to be three dimensional and then you simply combine several BSS units, maybe one in the vertical plain and one in the horizontal plain. And the nice thing about BSS is that at these outputs you have the original signals, the separated sources. So it's relatively easy to correlate these outputs and for example resolve all kinds of ambiguities which you would get if you have a set of angles for your desired sources in one plain and another set of angles in the vertical plain you would not know how to align those. But with the signals to correlate here, if you take these signals and correlate the -- them in the two plains then you find the proper pairs of angles. Nevertheless there is still a lot to be done because we are also interested in making this scheme reacting very fast to short acoustic events that means for example two shots or two glasses falling down, and then you usually don't have enough time to converge to a reliable solutions. >>: Is there some kind of [inaudible]. >> Walter Kellermann: Sorry? Some kind of trade-off? >>: [inaudible] because if you're doing [inaudible] globalization rather than for separation [inaudible] filters, then your separation isn't as good. >> Walter Kellermann: Right. That's basically what we do anyway. So we have -- I don't talk about implementation for certain applications but you would have at lest for localization as you say shorter filters and then you're of course converging faster but the result is not as reliable as you like to have it. So of course that's the way to go, but it's still not solving the problem completely because the reliability suffers. Okay? So let me finally go to localization in -- of multiple sources in another domain, namely in the wave domain and that brings me to the nice little array that Heinz Teutsch build awhile ago where we combine basically our idea of wave field analysis, that means describing the acoustic -- the acoustic wave field in terms of cylindrical harmonics, with the techniques that are known from array processing, namely the subspace methods. Usually these subspace methods are meant and most effective for narrow band signals. But of course we deal with white band signals and fortunately the wave field decomposition gives us a description of the wave field which can be handled as if it were never opened. So we can apply these methods here. And we use again cylindrical harmonics and this is very close to what Gary Elko proposes as eigen-beams when he using beamforming, when he uses these arrays for beamforming, but we use it for source localization and actually then we use the typical algorithms for subspace localization that is music or esprit, actually esprit turned out to be much better for our case, and thus allows us essentially to localize M minus one sources if we -- M minus one sources simultaneously and M defines the number of cylindrical harmonics that we consider, an actually we could localize up to five sources here with this arrangement. And this is a very small device as I said, about 10 centimeters in diameter. Again, this is not complete because these subspace methods are also assuming stationary sources, and you have to be quite reliable in estimating your covariance matrices and if you can't do that, then you have to find some improvement on the algorithms to deal with non-stationarity. We are on this road, but we are not yet at the goal, so -- and it's not yet involved completely. And finally, again, reverberant environments are another problem which is not completely solved here. It works in mildly reverberant rooms but not in strongly reverberant rooms. If you put it in here, it would probably work. This is not a bad environment for that purpose. Especially if you put the microphone array to that ceiling here, that would work, but if you go to nastier environments like train stations or large halls, then this didn't work that well. So far. So let me conclude. What I wanted to show here is that the plain acoustic human machine interface still offers a number of major challenges which can both be seen as signal separation and system identification tasks. And the reason why this is still a problem? Acoustic environment which we deal with is highly complex in terms of number of degrees of freedom and its time variance and also in the natural environment for these natural interfaces we usually miss the reference signal. So we have many blind problems. And this is one of the main things. And here is the list of what we've been looking at so far, multi channel acoustic echo cancelation, beamforming source separation de-reverberation and localization. This basically concludes my talk, but I would like to add to chance statement reverberant environments the really still one of the many challenges which we have, and that's the ones that we want to solve in the future that we see as the next major task, and we have been working for that on that for a while now but we are still pretty far away from perfect solutions. That's -- that concludes my talk. Thank you very much for your attention. [applause]. >> Ivan Tashev: We have time for a couple of questions. >>: [inaudible] obviously it's going to go to [inaudible] I just wonder whether this kind of signal [inaudible] are they going to go to [inaudible] or is there something going on already? >> Walter Kellermann: Maybe I didn't get your question. >>: [inaudible]. >> Walter Kellermann: You mean which kind of processing is going into hearing aids? >>: Yes. [inaudible]. >> Walter Kellermann: I don't -- I cannot talk too much about that. [laughter]. >> Walter Kellermann: But sunnily hearing aids are a very interesting area and it's highly constrained and it does not allow for any imperfections in speech technology. So hearing aid manufacturers will not allow say strong noise suppression with artifacts that you will not be having. >>: [inaudible] the cell phone. >> Walter Kellermann: The cell phone allows much more say artifacts and whatever is cheap will be bought, right. So it's a bit different. But nevertheless we are not only focusing on hearing aids because all the wave field synthesis stuff of course is obviously not suited for hearing aids. So -- but blind source separation is obviously something which is very well [inaudible] to hearing aids because the blindness does not only mean you are blind for the source positions, you are also blind for the sensor positions. And this is actually a feature that is important for hearing aids because hearing aids will not type in the distance between the microphones first or the -- say the direction -- the look direction of the small arrays they have behind each ear, right? So for these algorithms, for binaural algorithms you're actually blind with respect to the sensor positions as well. >>: So in that case [inaudible] the distance within one single ear with different microphone, the [inaudible] too small. >> Walter Kellermann: It's very small. There are couple of issues. I referred only to two microphones in the blind source separation problem. But of course nowadays hearing aids have up to three makes in each -- at each ear, right? And they are very closely spaced, they are highly [inaudible] beamformer, right, and so there are a couple of issues in the interaction of those with beam -- with blind source separation, for example. But this is -- I would say that's very specific area. >> Ivan Tashev: More questions? John. >>: Okay. One comment and one question. So which brought out the Hoshiyama beamformer, this beamformer developed by your student [inaudible]. >> Walter Kellermann: Yeah, if it were the same, we wouldn't have published it. >>: No, no, I didn't say it was the same. But we tried it out and we were sort of embarrassed because we couldn't publish the results because they weren't better than the MMSE beamformer. [inaudible]. >> Walter Kellermann: And that depends on how you implement it, and MMSE' flexible criterion. If you put the constraint into it, then it's different. >>: [inaudible] too. >> Walter Kellermann: MMSE is a differently criterion. So you, if you take a plain MMSE beamformer, you don't have the constraint in it, right? >>: Right. Okay. The question was you made the comment that de-reverberation is different from speech separation, right, or source separation. But there's this result from the ICA world, and they show that essentially [inaudible] is related to neutral information. >> Walter Kellermann: Yeah, that's no doubt about that. >>: If you minimize the neutral information you're essentially maximizing the [inaudible]. >> Walter Kellermann: Right but. >>: You saw from my results that reverberation increases negentropy, or, sorry, decreases negentropy. >> Walter Kellermann: Yeah. But this is a byproduct, I would say. It actually -- let's see. >>: So my question is doesn't the distinction go away between de -- reverberation and source separation, doesn't that go away if you consider non Gaussian statistics? >> Walter Kellermann: No. If you look at this here, so this is the BSS cost function, right. So that's the -- that describes mutual independence between the outputs, right? Now if you want to de-reverberate, right, then you have to put into these estimated densities source model for your signal and the model for your reverberation in each channel, right, then you can describe this as a say a multi-variant density with a separate term for the source model and a separate term for the -- for the impulse response, right? >>: Maybe we should talk about this later on, but there's a result from the ICA world that shows that if you enforce the constraint that the final sources, the sources that you separate must be decorrelated then maximizing negentropy is equivalent to [inaudible]. >> Walter Kellermann: That's also for scaler mixings. We are not talking about scaler mixings yet. >>: [inaudible] showed demonstrated that when you reverberate a signal it becomes more Gaussian. [brief talking over]. >> Walter Kellermann: But this is -- this case if I reduce my dimension D here to 1, right, so that's a special case if you go here to 1, then you're fine. But this is explicitly not done here because we know that our outputs are correlated signals and we treat them in the time domain. Right? The cost function is derived in the time domain. The implementation is done in the frequency domain. We don't care, right? We can do anything in the DFT domain as long as we are careful. But you have to be careful of course. But nevertheless the cost function is on the time domain samples. And then you go to the frequency domain for implementation purposes. And this is a very, very fundamental part of this concept. Unlike all these -- I'm getting nasty now. I'm sorry. [laughter]. See, here is the unconstrained DFT domain BSS world, and here is a block which I see as a dark hole where you do all kinds of repair mechanisms for example resolve the ambiguity problem and this is [inaudible]. >>: Walter, we can [inaudible]. >>: [inaudible]. >> Walter Kellermann: I don't want to discard these algorithms in principal but they are all -- they all set D equal to one. And then they need something to realign the frequency bins, right, and that basically can sometimes bring you back as in the paradomain to a model which is essentially in line with starting -- with uni-variant PDFs and so on. So you can go back a little bit. But many of the algorithms are really using repair mechanisms correlations between frequent bins and so on. Right? >> Ivan Tashev: I'm sorry we have no time for more questions. [laughter]. The next talk is about to start. Professor Kellermann is late already from the airport. Let's thank him. [applause].

>> Ivan Tashev: Good morning, everyone. For me... speaker of the second talk of our mini seminar today. ...

Related documents

Products

Support

&gt;&gt; Ivan Tashev: Good morning, everyone. For me... speaker of the second talk of our mini seminar today. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ivan Tashev: Good morning, everyone. For me... speaker of the second talk of our mini seminar today. ...