>> Ivan Tashev: Good morning, everyone. For me... speaker of the second talk of our mini seminar today. ...

advertisement
>> Ivan Tashev: Good morning, everyone. For me it's a great pleasure to introduce the
speaker of the second talk of our mini seminar today. Professor Walter Kellermann. I
don't think that I have to do long introductions here. He is well known as valuable
contributor to signal processing.
We know that he's currently chairman of the technical committee of audio and
electroacoustics IEEE signal processing society. He's a distinguished lecturer in IEEE.
But for many of us, of course he's just Walter Kellermann who brings new algorithms,
new ideas constantly tries to improve this or that particular algorithm in acoustic echo
cancelation by suppressing in microphone arrays constant participant of the conference
which he love to participate and take part of.
So without wasting too much time, Professor Kellermann has the floor.
>> Walter Kellermann: Thank you, Ivan for this very nice introduction. And as all of you
know, the main work is usually done by the students and this is the list of students I want
to recognize first before I start my talk. And some of you will know Robert Eichner, and
he promised to join us today. He's with Microsoft since about a year. And he is working
on an acoustic signal processing, which is the first line of the title. The next one is of
course password. But the main thing is of course the multi channel aspect of what we
want look at in this talk.
And I would like to start with a brief introduction, and I apologize already that it's not
going too much into details in this talk. I would like to focus on the general scope of this
area and also on some specific solutions of course in order to make you believe that
something works at least.
So the goal is pretty simple. It's an ideal seamless human machine interface, which
means this users here in an acoustic environment should be untethered and they should
be mobile and we foresee a number of loud speakers and a number of microphones to
do some signal processing. And the signal processing should fulfill the following tasks,
namely first it should reproduce signals for the listeners at their ears in the way they like it
or we would like them to listen to. And the other tasks are basically to capture source
signals and to determine the source precision. So you see already we have a
reproduction problem and we have a signal acquisition problem and the affect that it's a
problem is due to the acoustic environment here which creates feedback between the
loud speakers and the microphones. We have reverberation and reverberation comes in
two flavors here. Usually -- and John McDonald alluded to that already, of course we
have the reverberation here when picking up the microphone signals and the second
flavor is the reverberation that we hear when we are listening to loud speaker signals.
And basically this is something which we mostly ignore, but I will come to that later. It's a
perceptual thing that needs to be addressed in reproduction system, especially if it's high
quality. And finally we have, of course, noise and interference and we want to address
all of these three problems in the following. And one of the really problems is that in
many scenarios you have all of these problems together simultaneously. So signal
processing needs to look at it simultaneously.
You all know many of these applications. I categorize them into two classes. One is
basically hands free equipment for telecommunication and what I call here natural human
machine interaction which includes of course speech recognition. And you find them in
mobile phones, cars with us to computers but also for telepresence systems and this is
an area which I especially like because you may have more money to spend on
microphone and loud speaker arrays which is not the case for example in cars and
mobile phones and of course then there is all these smart applications like smart meeting
room, smart homes and sometimes you look at museums and exhibitions as well.
The second class is audio communication, and this means here we look at equipment for
stages and for recording studios and also virtual acoustic environments for example,
virtual concert halls or teleteaching studios and with regard to the teleteaching studios
you may think of musicians practicing in two different places together. This is a very nice
problem. There you can the requirements are pretty tough and that's something to look
at in the future which is very challenging.
And of course there's a more secretive application which is surveillance and as you can
see, I don't put too many subitems here for the same reason. Okay. Let me show you a
few microphone arrays which we built, and actually when I joined the university in
Erlangen we started with a project on microphone arrays for laptops with Intel at that
time, and here you see a microphone array of eight mics which will be used for
beamforming in the following and to illustrate the beamforming algorithms in the SQL as
well. Then actually Robert Eichner worked in a European project where the environment
got a little more challenging, namely it was hands free communication for emergency
cars like fire brigade cars where there's a lot of noise, there's a lot of tension on the
speakers and obviously this is only a reference microphone so this is not to be used in
practical applications. So the microphones are here.
We use microphone arrays also for interactive TV applications now in a recent European
project entitled dietsit [phonetic], and what you see here we have a large array and you
see logarithmically spaced sensors here along nine -- actually the whole array is about
1.5 meters. So that's for large screen TVs. And here you see another application for
localization actually in a parking garage. I hope the image is not too dark for you so that
you can see the array here and here. And this is meant for localization purposes and
highly reverberant environments like parking garages.
Another application of surveillance would be public spaces, and here you see another of
microphone array with a total of six sensors and here we have a very small one which is
also used for multiple source localization. It's only about ten centimeters of [inaudible].
>>: [inaudible]. That public space, how [inaudible].
>> Walter Kellermann: Basically you can use it for surveillance purposes so that you
immediately can steer a camera to when somebody smashes a window, for example. It's
mostly for bad guys, right. And of course I only show the civil applications here. Okay.
Then I would -- I always like to show a picture which is a microphone array which I did not
build. That's the one by Gary Elko here. And I always use this in order to illustrate what
also could be done but has been forgotten over pass years.
This is an array which basically is meant for communication between several auditoria
and this was at the high times of [inaudible] when they built this 360 element array for
basically picking up the sound in the auditorium and being able to broadcast questions to
the other auditorium as well. So you did not need a person to run around and put a
microphone to the asking people. So you could use this microphone array. And that was
actually back from the '80s and during my Ph.D. time, more or less exactly 20 years ago,
we digitized that by putting a DSP just behind each sensor. And it was one of the first
generation floating point DSPs.
Some of you may know on the reproduction side -- yeah?
>>: [inaudible] array [inaudible].
>> Walter Kellermann: That was actually originally it was full bandwidth for audio, but we
only use telephone bandwidth for evaluation. So it's eight kilohertz.
>>: [inaudible].
>> Walter Kellermann: In that, yeah.
>>: [inaudible].
>> Walter Kellermann: Oh, that was a lot of data actually. Yeah. We had cubic meters
of hardware. You may ask Gary for the details.
Okay. Then on the reproduction side we look into wave field synthesis for a few years
now, an actually this work is mostly done by Rudolf Rabenstein which also with our chair
and he's basically trying to recreate acoustic wave fields not just in a sweet spot but in an
extended area, and this for example would mean here in this area which is enclosed by
this rectangle. But you can have these loud speaker arrays also in that fashion.
Basically you use these panels and each of these panels is essentially an eight channel
loud speakers. So you have eight exciters behind these panels. And this was supported
by the European union also within the project of Caruso a while ago.
This is just for introduction. Now let me move over to the more generic stuff in terms of
equation. And I would like to look at the fundamental problems that we face here for
reduction and acquisition and then I would like to show some advances in our group
mainly in the area of signal acquisition and that looks at multi-channel acoustic echo
cancelation at beamforming. I have apologize for using beamforming. But we also look
at blind signal processing and then you may think of all your higher order, higher order
statistics. And we published a while ago a concept which we think is quite generic and
we entitled it 20 com that can be used for source separation, de-reverberation and source
localization and actually you can see it as a more general theory as incorporating advised
filtering like adaptive filtering and beamforming, like echo cancelation and beamforming
as well.
And finally I will briefly look at source localization in the wave domain and then I will come
back to one of these examples of microphone arrays that you have seen already. So let
me first start with the way we describe these problems and basically what we assume is
that we want to do linear signal processing on our two kinds of signals that are of interest
for us so we have a set of reproduction channel, U, and we have a set of signals, set that
we want to extract, and if you think of the inputs and outputs to this digital signal
processing unit here, you've these loud speaker signals as outputs and the retrieved
signals as inputs -- as other set of outputs, so we put those two in the output vector and
we as this should all be linear signal processing, we assume that a set of input signals U
and another set of microphone signals, Xs, processed as inputs and this G captures all
the linear signal processing, which is of course multiple input, multiple output, then you
can look at this G in terms of submatrices connecting these inputs, the reproduction
channels and the microphone signals individually and so you have these four
submatrices which we will look -- use in the following to describe the problem.
And then the goal is essentially to provide the decided listener signals at the ears here
and if you describe this in terms of linear systems theory, you see you have the
convolution of the loud speaker signals with the acoustic environment, this [inaudible]
system described by this matrix here an you have the additional noise.
This W of course captures two signals per listener in our case. And on the other hand,
we have the microphone signals X, which are basically a mixture of the source signals
convolved by the acoustic environment, the impulse responses again, plus the acoustic
echo resulting from the loud speakers transmitted via the acoustic environment plus the
noise vector and X that's for the noise. And one of the problem or one of the reasons
why this is a problem at all is that the acoustics can be quite nasty so all the elements in
our matrices here, H, they are basically impulse responses and usually you characterize
them by reverberation time and this is the time, the sound needs to decay by 60 DP if you
refer to T 60 and this is in cast approximately 50 milliseconds and in concert halls it
reaches up to one to two seconds.
Whenever you want to model these impulse responses you take usually FRI filters with a
number of coefficients that is typically T60 times the sampling frequency divided by 3,
that captures basically 20 -- or enough energy so that you -- the model error is below 20
DB.
These models are inherently non minimum phase, and if you look at tip impulse
responses here and you look at pulse zero diagrams, you see most of the zeros are
actually on the unit circle. So in this case it's an impulse response, we call it inner office
with a T 60 of about 300 milliseconds at a sampling rate of 12 kilohertz, this is the
impulse response and you have actually thousands of zeros on or very close to the unit
circle. Which tells something about how to deal with this in order for example
de-reverberate it or also equalize it.
>>: [inaudible].
>> Walter Kellermann: Yes?
>>: [inaudible].
>> Walter Kellermann: You have a flat delay from the source to the microphone. That is
not minimum phase.
>>: So it's not minimum phase?
>> Walter Kellermann: Non minimum. Right. It's not -- it's non minimum phase. Any flat
delay is non minimum phase.
Okay. So if we look now at the problems that we have to solve, given these acoustic
environments, then we of course in reproduction want to look at the desired signals at the
ears. So basically we want -- what we want to do is we want to create a set of desired
signals, WD here, which is of course related only to the reproduction channels by some
matrix HD, and this is basically the desired matrix that we have to provide in the end and
if we equate this to the actual path from the reproduction channels to the ears, then we
have to essentially equate this HD to this term here and you see in order to provide this
you can -- have to deal with the noise of course, you still have the noise at the ears, and
as a handle to that, you can use the microphone signals.
And of course as a second component you see that the path for you must be the desired
path and this is in reality this acoustic part and the one such matrix that you have to
condition the reproduction -- the signals that are upon reproduced this GVU.
So we have essentially two subproblems. One is an equalization problem which you may
look at as a deconvolution problem, so this unit must equalize this acoustic path in order
to achieve the desired characteristic and the second problem is an interference
compensation problem which basically means you have to use these microphone signals
and create an estimate of the noise at the ears that can be fed into the loud speakers and
of course this also has to account for this acoustic path.
Obviously this is not so simple and because you have to blindly identify the acoustics
here between the loud speaker and the microphone and also you have to identify the
noise component at the ears, which is also non trivial. This is a real challenge, and all
the active noise cancelation approaches we know so far always assume that they can
measure the noises close to the ears. But this is in practice in our desired environment
not possible because we want to be seamless and people should not need to wear any
devices class to the ears.
Okay. Let me briefly look at the state of the art in reproduction and just in terms of this
description if you look at conventional stereo and multiple channel systems there is
nothing like interference compensation and also equalization does not consider the real
acoustics it only considers something which you may like as a listener, you boost the
bases or change the amplification in various frequency bands so you have frequency
selective gains, but no accounting for the acoustics.
If you look at beamforming with loud speakers, that can be done either in a more
conventional way and/or a super directive beamforming, also ultrasound based, but still
there is no interference compensation so far and also there is often only a chorus room
equalization for a sweet spot but not for a whole extended area where people might want
to move while listening.
What can be done and is done actually in some products already is that they use impulse
responses for the desired characteristic already so that this -- there is a rough
equalization already but there's no room characteristics inversion or equalization directly.
The furthest we come so far is wave field synthesis and there are first attempts for
interference compensation. So actually Alheim Kuntz started to work on this in 2004 but
most of the time in practical environments it's still not possible to for example follow the
time variance of the acoustic environment. And on the other hand, there are some
results on equalization, but also only for time invariant room acoustics and Sascha Spors'
on that for about four years now. But most of the time this is also even in wave field
synthesis, just a set of impulse responses that are used for the reproduction channels in
order to mimic sound artificial rooms without accounting for the real listening
environment.
So you can listen to sounds like in churches, but you don't really invert the local listening
acoustics. Okay. There are quite a few challenges involved with that. One of -- is the
equalization and the other is the interference compensation and both if you want to do it
correctly perfectly require blind identification of these acoustics. And this is something
which is still pretty far out of reach I would say. But it's of course a nice challenge for
academia. One probably, however, is that many people argue we may not need that
actually. We don't need that much spatial realism. We don't -- we may not really need all
the identification problem which is highly sensitive to -- and we publish papers on that
how sensitive that is. So it's really questionable how much spatial realism can we really
appreciate? And this is one of the questions that is still open in this area and so it may
be that's also one of the reasons why not so much effort is devoted to this.
>>: [inaudible]. Even if you can identify the system perfectly though in its minimum
phase you can't divert it, right?
>> Walter Kellermann: Right. You have to have an estimate for that. But as the
listeners are usually not really sensitive to all-pass filtering and you know any filter can be
decomposed in all pass in a minimum phase system, then it would be sufficient to
equalize the minimum phase part. So you cannot perfectly identify that in many cases,
right? But if you have a good guess for the all-pass filter, that might be sufficient.
Okay. Let me turn to the signal acquisition part, and this is where we are quite sure that
it's reasonable to identify the systems that we need to identify. Here the problem is that
we essentially want to extract a set of desired signals out of these acoustic scene and
they should be undistorted in the sense that they should sound as if you put a
microphone very close to the mouth here and removed all other effects. So if you
compare this to what you really get of course you have these microphone signals, as I
said, this is a mixture of the source components, the acoustic echo and the noise and this
can be treated now by using your other inputs, namely the reproduction channels and of
course this -- then you have these two submatrices chi zed U and chi zed X, this one to
treat the microphone signals and to get what you actually want to get. This results in
especially three subproblems. One is the acoustic echo cancelation problem, so where
you essentially need to model this acoustic path, including those matrices here by this
matrix and once you equate those two paths, then of course the acoustic echo is gone,
then it's basically what you want to do.
So you want to remove all components that are related to your production channels. The
second problem is the source separation or empty reverberation problem that essentially
treats the components resulting from the desired sources. So what you get here at the
microphone is the set of -- the mixture of all the desired sources, each convolved with the
acoustic path and pulse response. And you can treat that with the matrix chi zed X and
what the result should be basically a flat delay of the desired sources. And finally you
would like to suppress all the interfering noises and this means essentially that this matrix
should nullify all the noise components. This looks very simple, but of course chi zed X
has also to fulfill this and to fulfill this, so this is basically a problem that this chi zed X
needs to satisfy all of them, all of the requirements. And of course one of the problems
that you don't see so easily is that first of all, you'd have to separate all these various
components in the microphone signals. And this is also to be done by chi.
Okay. Now, in the -- after talking so much about problems, I would like to come a little bit
to a few solutions. One is the multi channel acoustic echo cancelation. And here we
start with an isolated single channel case in order to define the problem again.
Essentially this is a textbook problem since 25 years it has been recognized as such and
what you have to do is you find a filter which MMIics the impulse response via the loud
speaker, the acoustics and the microphone in order to remove the components of U. You
do this as you know this acoustic environment is time variant by an adaptive filter which
usually approximates the VF solution. Sorry for using second order statistics here but
this is quite appropriate and has been working for many years, also, we actually use
some higher order statistics in robust filtering as well.
But for here it's sufficient to assume we have adaptive filters that approximate the VF that
solution. And why is this a problem at all? Why do people talk about it, and I've seen a
talk about Microsoft people at eye wink, the problems on implementing this. It's a long
filter, and typically to get 20 DB echo suppression in a conference room at 12 kilohertz
sampling rate still asks for 1,000 or more coefficients. In the living room it's easily 4,000
coefficients you would like to realize in order simply to mimic this acoustic or model this
acoustic path.
The multi-channel case is a little more complex because of two reasons. The one is you
have of course obviously if you have K loud speakers, you have K times more filter
coefficients to adapt which makes the problem computationally more expensive but the
real problem for academia is of course that the or correlation matrix here which is
decisive to the adaptation speed and the adaptation behavior of the adaptive filters is
worse conditioned than in the single channel case. And I tried to illustrate this here by
this the matrix in the single channel case you see a diagonal dominant matrix and this
matrix is the auto correlation matrix of this state variables of the FIR filter.
And usually it's at least diagonal dominant because also speech has a decaying
correlation over it, so this would be -- look like this. If you have three channels, most of
these channels are correlated so you can easily match in stereo of 5.1. The individual
channels are strongly correlated. And then the auto correlation matrix of the set of all
input vectors for all these filters looks more or less like this. And if you just [inaudible]
there's a lot of regularity and you find columns and rows which are very similar because
the diagonals in the off-diagonal matrices are still similar to the main diagonal, so this
matrix is usually pretty ill conditioned. There are couple of solutions to these two
problems regarding the cross correlation between the channels. There are basically only
bad solutions. So whatever you do, it's somehow affecting your loud speaker signals
because correlation is something you cannot remove bilinear [inaudible] invariant
processing, so you have choices either you at some non-linearity here and of course this
must be inaudible. And unfortunately the more inaudible it is, the less it helps, right.
But that's quite natural. And you can add noise in the same holes, right. So you should
use it below the masking threshold of the ear so that people buying very expensive audio
equipment, $20,000 loud speaker equipment, they should not hear any noise, right? But
it still must be strong enough to de-correlate the signals, which is a dilemma, right?
The third one, and we think this is the best solution, is to introduce time varying all-pass
filtering which are tuned such that the perceived sources don't move in space. So you
have different all passes here, they are time varying, so actually they slightly move the
sources if you have a stereo reproduction you would think that if you change the faces of
the signals the sources start to move. Actually they do, but they do it only to the extent
that you cannot hear it perceptually. So your binaural hearing is not as high -- resolving
that high that you can hear that. And this is actually done with the audio coding people,
so you see Uden Harris [phonetic] name here who was an intern at that time with us and
then we developed this method. And we think and we publish that actually at ICAST
2008 or 7. 7. This works actually quite well. So this is one way of playing dirty tricks to
resolve this correlation problem.
And here is the solution or at least one solution to the complexity issue you essentially go
to the DFT domain and that results in this case of in the fact that rather than inverting
huge matrices, you only invert many matrices of small size. Think of five channels and
4,000 coefficients in each channel. So that's a 20,000 by 20,000 matrix is not to be
inverted but you rather invert 4,000 matrices of size 5 by 5. And this is essentially what
you do in realtime.
So if you look at the results, then you feel that you can obtain reasonable convergence,
much better of course than an LMS, here is a convergence curve for the system error
nodes which tells you how well we identify the real impulse responses and you get a
reasonable convergence with this generalized frequency domain adaptive filtering which
Herbert Buchner and others proposed. And this is here for two to five channels over time
you see the convergence is pretty slow because this is seconds. What you really here is
actually the echo return loss enhancement, and this is the suppression of the actual echo
which converges much faster to a level of about 20 DB and two seconds.
Now, it's needless to say that we claim that the production quality with our time varying
filters is still much better than the others, and we actually verify that by [inaudible] tests
but I would like to play an example here. And I hope the audio works now.
So this is a microphone signal and it actually is recorded for distance speech recognition
task. Some of you who know German will identify that.
[tape played].
>> Walter Kellermann: This is microphone signal.
[tape played].
>>: [inaudible].
>> Walter Kellermann: That's not you?
>>: No, it's not me.
>> Walter Kellermann: Pardon? No, no, it's Herbert Buchner.
[tape played].
>> Walter Kellermann: Sorry. Okay. That's basically the effect you could -- I hope you
could hear the convergence in the beginning and you get down through easily through a
DBB of echo suppression -- echo cancelation. There's no post filter involved for those of
you who -- post filtering you can do better even but for at this -- in this application it's still
a good [inaudible] pardon?
>>: [inaudible].
>> Walter Kellermann: There's no suppression. It's just the filter, it's just the five channel
filter. So you could even do better of course but you would not hear that in this
application.
>>: [inaudible].
>> Walter Kellermann: Sorry?
>>: [inaudible].
>> Walter Kellermann: [inaudible]. This is simply a perceptual test with -- where you
take in this case about a dozen people to listen and you have hidden anchors so you
have ideal reference signals and then you all rank relative to that . Okay. There is a
real-time implementation running on a PC, and this is actually quite old already, so it was
about five years ago when we implemented that first on a PC. So nowadays PCs are
much faster and -- but still at that time already we had -- could realize 25 coefficients.
So what we are doing right now is a little more challenging. It is going from five channel
to more, and in this case you see a micro -- a loud speaker array of 48 channels for array
field synthesis and Herbert Buchner actually published generic concept for wave domain
adaptive filtering in 2004. Which essentially transforms the problem into this -- another
domain, namely the wave domain. The idea is here that you do not identify the individual
channels from each loud speaker to each microphone but you decompose the wave field
which is generated by the wave field synthesis into the harmonics, in this case cylindrical
harmonics of the space, so it is essentially special functions in the end.
And this transform comes first and then you go to the DFT domain and basically what you
do is say in a setup with 48 loud speakers and 48 microphones rather than identifying 48
squared adaptive filters for identifying all the elements in this acoustic matrix, then you
only need to identify K in this case 48 acoustic passes for the according for the wave field
decomposition in terms of the cylindrical harmonics.
So essentially you bring down the computation complexity from K squared to K -- actually
a little more than K but that actually works quite well. We had papers in the early stage
and simulated data but we are now about to implement this in real-time hardware as well.
So nevertheless this is an ideal environment in the sense that the geometry of the loud
speaker array is quite simple and nice to handle because then the cylindrical harmonics
tell you how to do this, how to implement this transform but the real challenge is to go to
say listening spaces like your living room at home where you don't want to have a
cylindrical array or circular array you want to build the loud speakers into the walls and
then you need other transforms. And that makes it really hard. So that's still unsolved.
But that's something to look forward to.
Then the second topic and I hope I don't run out of time is the second topic is on
beamforming that we use for signal extraction and interference suppression and I have to
say we are not explicitly looking for speech recognition application here but we did use it
for a speech recognition as well. Basically if you look at it, it's simply a signal separation
task. So you have this box here and you would like to separate those signals and that
means you can't use three domains, either time, frequency, or space. Of course space
requires more than one sensor here.
If the origin of the signal is known and that may be either in time frequency or space then
there is separability is only determined by the aperture width, basically time frequency
resolution or spatial resolution. That's what determines how well you can separate the
signals. But -- and if the origin of the signal is unknown in these three domains, then you
can talk about blind signal separation. So reducing this to the spatial dimension, you
have the supervised beamforming for the case that the spatial origin is known and you
have for the case that the spatial origin is unknown the blind beamforming a blind source
separation. And actually both terms have been used by blind source separation sound
more fancy and then most people forget that it's essentially blind beamforming.
Let me start with supervised beamforming briefly and it's what we see here is just an
example for a signal independent filter in some beamform which in the conventional way
which should just be understood as an example what you want to achieve. It looks at
zero degrees so that's the steering angle and it's eight microphones and actually this can
be implemented directly on the loud -- microphone array as you have seen it on the
laptop. The spacing is four centimeters and what we used here was a [inaudible] design
above a certain frequency which was about 1200 hertz here and then from then on you
have a nearly constant beamwidth main lobe and but at low frequencies you see the
spatial resolution deteriorates drastically. Please note this is sensitivity, this frequency,
this is space and angle arrange. And you see the low selectivity with the low frequencies.
In the conventional case.
The way to overcome this is often seen as a different way of beamforming but essentially
it's not a different way, it's just providing different filters to the individual microphone
channels and then you talk about super directive beamforming, it's still signal
independent and it's still signal independent filters and for the same geometry you see on
this heat plot that you can get a much better resolution at low frequencies while still
preserving the constant beamwidth around zero degree here. This example actually is
due to frequency invariant beamform proposed by power and the problem here is actually
that in the low frequency region you get very sensitive to incoherent noise and calibration
errors of few microphones.
And actually that's not on the slides here but we recently developed a method which
incorporates basically constrains on the sensitivity here into, directly into the
beamforming design that for a given sensitivity you can in one shot design the
beamforming perfectly.
The third kind of beamforming I would like to put forward here is actually a robust version
of the generalized [inaudible] canceler which we used for speech recognition as well.
And actually it was not mentioned on your slide, so it would be nice to compare the
recognition results with yours. But it -- as John already mentioned, basically what you do
here and you may remember this plot diagram, is constrained beamforming with a
minimum squared criterion, so it's a second order statistics based criterion in its original
version, but I can easily use other optimization criteria with a problem of estimating the
higher order statistics. The basic idea is here. You minimize the output with the
constraint that the desired source should not be distorted and for that and the clever idea
of the generalized sight [inaudible] is actually that the two problems are decomposed so
you have the constrained realized up here and you have an unconstrained adaptation
down here so that makes it really efficient. But what's important in the acoustic domain is
that you adapt the blocking matrix as well to get good quality. The blocking matrix
namely has to provide an output signal which is a good estimate of the noise. And if the
desired source moves here then it's important that the this -- that this blocking matrix
does not allow any leakage of the desired signal because this would -- the interference
canceler, which is placed here, would then cancel out the desired signal from here as
well.
This is actually quite a problem which you need to avoid. So this is why I write here the
adaptive blocking matrix is actually quite important. And in order to make this converging
quite rapidly even under double talk situation so we're both interferer and desired source
are active, it's desired to use an adaptation in the DFT domain again and I will play you
an audio sample for this here. So first you hear a signal microphone with a two sources,
one is the interferer and it's at an SIR of 3 DB here. And it's actually the [inaudible].
[tape played].
>> Walter Kellermann: So let's listen to the output of the fixed beamformer. And please
note that the low frequencies usually cannot be suppressed well, because this here is the
fixed beamformer like you've seen it a couple of slides ago, this one, this one, right, and it
does not suppress low frequency content well. Okay.
[tape played].
>> Walter Kellermann: And now finally the output of including the effect of the
interference canceler, which should reduce the low frequency interfering noise, interfering
speech as well.
[tape played].
>> Walter Kellermann: Okay. And if you measure this you have an SIR gain of about 17
DB here. But as you could see already, there is still a challenge here especially if you go
to highly reverberant environment. And so as John pointed out already, this way of -- this
version of beamforming does not cope completely successfully with reverberation if it
does not have enough degrees of freedom by way of many microphones.
So let me move over to the blind part of the -- of this talk where -- and I will folks only on
the 20 com concept which is being developed in my group. And first of all I'd like to look
at source separation and then briefly at de-reverberation but also a localization and
especially localization of multiple sources and even in multiple dimensions. And we'll see
in a minute what this means.
So first of all, the blind source separation scenario is given here so you see multiple
sources, multiple microphones. We assume that the number of sources and the number
of microphones is equal and we are looking for a demixing matrix here and chi zed X,
that's actually the same building block that we had in the previous but now it's meant for
demixing. And what you essentially want is you want to extract the desired sources here,
so if you ideally achieve that, that would mean you had to invert these acoustic mixing
matrix and then you would get simply a delayed version of the original signal.
This is not what blind source separation does actually, actually blind source separation
does not blind deconvolution, rather it only tries to minimize the mutual information
between the output in the best case. So the strongest criterion that you can fulfill for blind
source separation is that the minimum mutual information between the outputs is
minimized. 20 con actually, 20 com provides a framework which essentially describes
this criterion, the minimal mutual information criterion and it is especially designed for
non-white stationary and, non-Gaussianity of the source signals.
And if you look at the cost function for BSS here, it's actually 20 com more generic and
can be used for other like purposes like localization and de-reverberation as well. But if
you look at the cost function for BSS, you try to minimize the mutual information for
between these outputs by looking at dense multi-variant densities of each of these
outputs. So here, for example, you have the luck of estimated D dimensional densities
for each output, so just imagine you take D variables here and you look at the
multi-variant density of these D output samples of one channel then you multiple all of
these D multi-variant outputs so that means the independence of the individual outputs
and you contrast this to the joint density over all the output values over all the channels.
This is basically the idea.
So of course if you write this down as a cost function, you don't have an algorithm yet
how to achieve the best filter coefficients.
But this is basically the underlying idea for BSS. And there is some averaging here which
allows you to cope with non-stationarity and also here you have a recursive averaging.
So -- yes?
>>: [inaudible]. It seems like this would be, you made the assumption that you only have
two sources but it seems that that approach would be super sensitive to noise if you
[inaudible] a little bit of background noise it would mess this up.
>> Walter Kellermann: I did not mention that making these probability densities is of
course very difficult in general. We all know higher the order of the moment you
estimate, the more sensitive it's to noise. And of course if you just think of estimating
[inaudible], it's a fourth-order moment but the fourth-order moment already has a
variance of the estimate to the eighth power, right? So you're exploding insensitivity,
right? You can use parametric models of course, but this is still the generic idea. So
whatever you can put into that as models that will help and we will go -- come to this. But
this basically is the starting point, and everybody is basically derived from that by
specialization and the first and simple case that you would look at is second order
statistics which essentially means you assume Gaussian densities. Also you may know
that these are not representing speech signals well but nevertheless it's a way to start.
And then you know for example that these high dimensional densities can be captured
and entirely described by correlation matrices and that helps. Okay. So then you -- the -if the source model is just a multi-variant Gaussian then you have simply the correlation
matrix to consider and that also actually leads to a relatively simple adaptation algorithm,
the natural gradient for example can be computed from the cost function then. So if you
reduce the cost function to second-order statistics you can compute the natural gradient
and you have a relatively simple update rule. And if you look at this here, it actually
means you have to invert a matrix very similar to RLS, to recursive least squares. To
illustrate what we do here is you see these correlation matrices for the two channel case.
Actually whatever we said above holds for the P channel case but this is for the two
channel case.
So if you don't do anything and you just look at the observed mixtures at the
microphones, you would see correlation matrices like this for each channel and you
would see cross correlation matrices like these of diagonal matrices. And these are of
course of the dimension that you foresee for your demixing filters. So if you foresee the
mixing filters of length 1,000, then you still have 2,000 by 2,000 matrices here.
And what this algorithm would try to achieve is to remove the cross-correlation between
the channels which actually means to nullify is off diagonal matrices. If you want to
de-reverberate the signals a little, been you would remove also the outer correlation part
of these correlation matrices here, assuming that the inner-correlation is due to the
speech model and the outer part is due to the impulse response of the filter. And this is
what we call partial deconvolution multi-channeled blind partial deconvolution here.
And if you completely want to whiten your output signals, then you would perform a
compete deconvolution that's here and multi-channel blind deconvolution then this matrix
would be diagonal again.
There is a real-time implementation actually since 2004, and we continuously worked on
improving it. It's for P equals 2, two channels on a PC and Robert Eichner, I saw him a
while ago, he's gone in the meantime, but he's the one who supervised the master, he's
the student who implemented that in real-time. And there is still a lot to be done, and it's
no secret anymore that we work on that for hearing aids where there are heavy
constraints but even without the heavy constraints of this application, the reverberation is
still a problem and especially the case where you have more sources than sensors. If
you think of a cocktail party effect where you actually have only two hearing aids in two
ears which may be linked binaurally but you still are not able to deal with 10 sources. So
there's still a lot to be done, but there is some progress under way and we will publish
very soon on that.
>>: Out of curiosity, have you ever tried to -- have you ever tried to [inaudible] that you
would have, you have a Gaussian circular process [inaudible].
>> Walter Kellermann: Yeah. You will see this in Herbert Buchner's Ph.D. thesis.
>>: [inaudible].
>> Walter Kellermann: Finally I hope. [laughter].
Okay. Anyway, I would like to give you an impression of what we can achieve for the
three channel case actually. So as I said, 20 com is a relatively generic scheme, and
whatever is -- and this is actually only one branch which already makes the assumption
that we can describe all this sources by [inaudible] which is actually what you mentioned
by the [inaudible] chi function. So this is a way of creating multi-variant densities based
on single-variant densities. And extending that relatively simply. And then you have -you see you have the one branch for the multi-variant Gaussians an you have another
branch where you have univariant PDFs and this is basically branches where you see all
the most of the common BSS techniques aligned. So this is actually a more generic
scheme which has these special cases. And Robert Eichner actually developed a
relatively efficient version for multi-variant Gaussians which I want to play here. You will
see the sensor signals or listen hear the sensor signals first.
[tape played].
>> Walter Kellermann: And if you listen to the output of the BSS scheme.
[tape played].
>> Walter Kellermann: I guess you could hear the initial convergence phase a bit. Let
me play the other one.
[tape played].
>> Walter Kellermann: The second time I don't play the female, but anyway, what you
get if you measure that, the performance, it's about 19 DB in separation gain and one of
the important features here is, which I like to emphasize you basically have no distortion
of the desired source. So you can -- there is a slight coloration, but there is no distortion
like you have it usually in noise suppression techniques and spectral subtraction.
Okay. So let me move over to de -- reverberation, and de-reverberation is here
considered -- or depicted as a single-channel problem. We actually look at it in the
multi-channel problem -- I mean the multi-channel setting, but the main idea is here real
that you consider this speech signal as being a correlated signal where the first part of
the correlation is due to the vocal tract and the rest is due to the room characteristics
which means you would like to preserve these diagonals in the cross-correlation matrix
and you would like to get rid of the outer diagonals. That's a relatively simple idea, but
it's relatively hard to put this into an efficiently adaptive algorithm.
So because it asks for another constraint in the adaptation. Nevertheless you can do this
and please note as in the blind source separation context we use filters of length 1,000 at
a sampling rate of 16 kilohertz here, we have again two source, two channels, and the
nice thing about this scheme is that it of course does separation and de-reverberation at
the same time. And actually the separation is even improved relative to the [inaudible]
algorithm and the de-reverberation is also clearly stronger than with for example a delay
in some beamformer, but of course if you only have two makes the delay in some
beamformer will not give you much.
I'm not sure whether you can hear the effect in this room because the original signal is
not too reverberant either but let's try it.
[tape played].
>> Walter Kellermann: Okay. So basically this is how far you get within any extra
measures, without any subtraction and post filtering or equalization of the spectrum. This
is really what comes out of the 20 com algorithm. And of course if you feel that there's
any coloration, you can still equalize that.
Let me look at localization briefly and I see I'm almost running out of time, but we're
getting close. Okay. We can use that blind scheme for localization of multiple sources
as well. And the idea is essentially again relatively simple because if you consider this
source here and the effort that is done by the demixing matrix in BSS, it means that if you
want to separate these two sources from each other and you want to get S 1 here, you
need to identify filters here in these two passes, which essentially are equal to these two
channels. So this acoustic channel should be modelled by this one and this one should
be modelled by this filter.
So if you look then at these filters and look at the main peaks you will find the direct
acoustic path. And the peaks of the acoustic paths tell you the relative delay between the
two signals and that is an indication of the direction of arrival and then you can basically
find the local -- the source location. That is simple as long as you have only two sources
and it can actually be seen as a generalization of the adaptive Eigenvalue decomposition
algorithm that Vanisty [phonetic] published in '99.
If you have more than two sources then you can compute something like directivity
patterns at the output of these -- of the BSS scheme and these zeros that you will find,
they point to the suppressed sources. So you only average over several directivity
patterns and you get pattern of all these suppressed sources and these suppressed
sources will tell you the locations. So that's basically a DOA estimator. And that works in
the overdetermined case as well, so that's the case where you have more sources than
you have microphones.
You can also generalize this to multiple dimensions because if usually localization is only
seen in a plain is a problem, but of course sometimes you want to be three dimensional
and then you simply combine several BSS units, maybe one in the vertical plain and one
in the horizontal plain. And the nice thing about BSS is that at these outputs you have
the original signals, the separated sources. So it's relatively easy to correlate these
outputs and for example resolve all kinds of ambiguities which you would get if you have
a set of angles for your desired sources in one plain and another set of angles in the
vertical plain you would not know how to align those. But with the signals to correlate
here, if you take these signals and correlate the -- them in the two plains then you find the
proper pairs of angles. Nevertheless there is still a lot to be done because we are also
interested in making this scheme reacting very fast to short acoustic events that means
for example two shots or two glasses falling down, and then you usually don't have
enough time to converge to a reliable solutions.
>>: Is there some kind of [inaudible].
>> Walter Kellermann: Sorry? Some kind of trade-off?
>>: [inaudible] because if you're doing [inaudible] globalization rather than for separation
[inaudible] filters, then your separation isn't as good.
>> Walter Kellermann: Right. That's basically what we do anyway. So we have -- I don't
talk about implementation for certain applications but you would have at lest for
localization as you say shorter filters and then you're of course converging faster but the
result is not as reliable as you like to have it. So of course that's the way to go, but it's
still not solving the problem completely because the reliability suffers. Okay?
So let me finally go to localization in -- of multiple sources in another domain, namely in
the wave domain and that brings me to the nice little array that Heinz Teutsch build
awhile ago where we combine basically our idea of wave field analysis, that means
describing the acoustic -- the acoustic wave field in terms of cylindrical harmonics, with
the techniques that are known from array processing, namely the subspace methods.
Usually these subspace methods are meant and most effective for narrow band signals.
But of course we deal with white band signals and fortunately the wave field
decomposition gives us a description of the wave field which can be handled as if it were
never opened. So we can apply these methods here. And we use again cylindrical
harmonics and this is very close to what Gary Elko proposes as eigen-beams when he
using beamforming, when he uses these arrays for beamforming, but we use it for source
localization and actually then we use the typical algorithms for subspace localization that
is music or esprit, actually esprit turned out to be much better for our case, and thus
allows us essentially to localize M minus one sources if we -- M minus one sources
simultaneously and M defines the number of cylindrical harmonics that we consider, an
actually we could localize up to five sources here with this arrangement.
And this is a very small device as I said, about 10 centimeters in diameter. Again, this is
not complete because these subspace methods are also assuming stationary sources,
and you have to be quite reliable in estimating your covariance matrices and if you can't
do that, then you have to find some improvement on the algorithms to deal with
non-stationarity. We are on this road, but we are not yet at the goal, so -- and it's not yet
involved completely. And finally, again, reverberant environments are another problem
which is not completely solved here. It works in mildly reverberant rooms but not in
strongly reverberant rooms. If you put it in here, it would probably work. This is not a
bad environment for that purpose. Especially if you put the microphone array to that
ceiling here, that would work, but if you go to nastier environments like train stations or
large halls, then this didn't work that well. So far.
So let me conclude. What I wanted to show here is that the plain acoustic human
machine interface still offers a number of major challenges which can both be seen as
signal separation and system identification tasks. And the reason why this is still a
problem? Acoustic environment which we deal with is highly complex in terms of number
of degrees of freedom and its time variance and also in the natural environment for these
natural interfaces we usually miss the reference signal. So we have many blind
problems. And this is one of the main things. And here is the list of what we've been
looking at so far, multi channel acoustic echo cancelation, beamforming source
separation de-reverberation and localization.
This basically concludes my talk, but I would like to add to chance statement reverberant
environments the really still one of the many challenges which we have, and that's the
ones that we want to solve in the future that we see as the next major task, and we have
been working for that on that for a while now but we are still pretty far away from perfect
solutions. That's -- that concludes my talk. Thank you very much for your attention.
[applause].
>> Ivan Tashev: We have time for a couple of questions.
>>: [inaudible] obviously it's going to go to [inaudible] I just wonder whether this kind of
signal [inaudible] are they going to go to [inaudible] or is there something going on
already?
>> Walter Kellermann: Maybe I didn't get your question.
>>: [inaudible].
>> Walter Kellermann: You mean which kind of processing is going into hearing aids?
>>: Yes. [inaudible].
>> Walter Kellermann: I don't -- I cannot talk too much about that. [laughter].
>> Walter Kellermann: But sunnily hearing aids are a very interesting area and it's highly
constrained and it does not allow for any imperfections in speech technology. So hearing
aid manufacturers will not allow say strong noise suppression with artifacts that you will
not be having.
>>: [inaudible] the cell phone.
>> Walter Kellermann: The cell phone allows much more say artifacts and whatever is
cheap will be bought, right. So it's a bit different. But nevertheless we are not only
focusing on hearing aids because all the wave field synthesis stuff of course is obviously
not suited for hearing aids. So -- but blind source separation is obviously something
which is very well [inaudible] to hearing aids because the blindness does not only mean
you are blind for the source positions, you are also blind for the sensor positions. And
this is actually a feature that is important for hearing aids because hearing aids will not
type in the distance between the microphones first or the -- say the direction -- the look
direction of the small arrays they have behind each ear, right? So for these algorithms,
for binaural algorithms you're actually blind with respect to the sensor positions as well.
>>: So in that case [inaudible] the distance within one single ear with different
microphone, the [inaudible] too small.
>> Walter Kellermann: It's very small. There are couple of issues. I referred only to two
microphones in the blind source separation problem. But of course nowadays hearing
aids have up to three makes in each -- at each ear, right? And they are very closely
spaced, they are highly [inaudible] beamformer, right, and so there are a couple of issues
in the interaction of those with beam -- with blind source separation, for example. But this
is -- I would say that's very specific area.
>> Ivan Tashev: More questions? John.
>>: Okay. One comment and one question. So which brought out the Hoshiyama
beamformer, this beamformer developed by your student [inaudible].
>> Walter Kellermann: Yeah, if it were the same, we wouldn't have published it.
>>: No, no, I didn't say it was the same. But we tried it out and we were sort of
embarrassed because we couldn't publish the results because they weren't better than
the MMSE beamformer. [inaudible].
>> Walter Kellermann: And that depends on how you implement it, and MMSE' flexible
criterion. If you put the constraint into it, then it's different.
>>: [inaudible] too.
>> Walter Kellermann: MMSE is a differently criterion. So you, if you take a plain MMSE
beamformer, you don't have the constraint in it, right?
>>: Right. Okay. The question was you made the comment that de-reverberation is
different from speech separation, right, or source separation. But there's this result from
the ICA world, and they show that essentially [inaudible] is related to neutral information.
>> Walter Kellermann: Yeah, that's no doubt about that.
>>: If you minimize the neutral information you're essentially maximizing the [inaudible].
>> Walter Kellermann: Right but.
>>: You saw from my results that reverberation increases negentropy, or, sorry,
decreases negentropy.
>> Walter Kellermann: Yeah. But this is a byproduct, I would say. It actually -- let's see.
>>: So my question is doesn't the distinction go away between de -- reverberation and
source separation, doesn't that go away if you consider non Gaussian statistics?
>> Walter Kellermann: No. If you look at this here, so this is the BSS cost function, right.
So that's the -- that describes mutual independence between the outputs, right? Now if
you want to de-reverberate, right, then you have to put into these estimated densities
source model for your signal and the model for your reverberation in each channel, right,
then you can describe this as a say a multi-variant density with a separate term for the
source model and a separate term for the -- for the impulse response, right?
>>: Maybe we should talk about this later on, but there's a result from the ICA world that
shows that if you enforce the constraint that the final sources, the sources that you
separate must be decorrelated then maximizing negentropy is equivalent to [inaudible].
>> Walter Kellermann: That's also for scaler mixings. We are not talking about scaler
mixings yet.
>>: [inaudible] showed demonstrated that when you reverberate a signal it becomes
more Gaussian.
[brief talking over].
>> Walter Kellermann: But this is -- this case if I reduce my dimension D here to 1, right,
so that's a special case if you go here to 1, then you're fine. But this is explicitly not done
here because we know that our outputs are correlated signals and we treat them in the
time domain. Right? The cost function is derived in the time domain. The
implementation is done in the frequency domain. We don't care, right? We can do
anything in the DFT domain as long as we are careful. But you have to be careful of
course. But nevertheless the cost function is on the time domain samples. And then you
go to the frequency domain for implementation purposes. And this is a very, very
fundamental part of this concept. Unlike all these -- I'm getting nasty now. I'm sorry.
[laughter].
See, here is the unconstrained DFT domain BSS world, and here is a block which I see
as a dark hole where you do all kinds of repair mechanisms for example resolve the
ambiguity problem and this is [inaudible].
>>: Walter, we can [inaudible].
>>: [inaudible].
>> Walter Kellermann: I don't want to discard these algorithms in principal but they are
all -- they all set D equal to one. And then they need something to realign the frequency
bins, right, and that basically can sometimes bring you back as in the paradomain to a
model which is essentially in line with starting -- with uni-variant PDFs and so on. So you
can go back a little bit. But many of the algorithms are really using repair mechanisms
correlations between frequent bins and so on. Right?
>> Ivan Tashev: I'm sorry we have no time for more questions. [laughter].
The next talk is about to start. Professor Kellermann is late already from the airport.
Let's thank him.
[applause].
Download