>> Cha Zhang: Good afternoon. It's my great... Ramani Duraiswami and Adam O'Donovan to give a talk audio...

advertisement
>> Cha Zhang: Good afternoon. It's my great pleasure to introduce Professor
Ramani Duraiswami and Adam O'Donovan to give a talk audio cameras for audio
visual scene analysis.
Professor Ramani Duraiswami is an associate professor in the department of
computer science in the institute for advanced computer studies at the university
of Maryland College Park.
He obtained his bachelor of technology degree from IIT Bombay and PhD from
John Hopkins University. He currently directs research at the Perceptual
Interfaces and Reality Laboratory at the University of Maryland. His current
research interest can include areas in audio for virtual reality, human computer
interaction and scientific computing, such as multicore, GPU and stuff and the
computational machine learning and vision.
Adam O'Donovan is a PhD candidate and graduate research assistant at
department of computer science at the University of Maryland. He has BS
degree in physics and computer science from Maryland. He received the NVIDIA
fellowship for 2008 and 2009 and University of Maryland Prime fellowship in
2007 to 2008. He has been interned in a couple places, including Microsoft
Research that was last year.
So without further ado, let's welcome.
>> Ramani Duraiswami: Thank you, Cha.
[applause].
>> Ramani Duraiswami: So, I'm going to talk about some recent we've been
according in -- and device we developed which we call the audio camera. And
so our goal, the goal of our research, I guess it's a bit loud so I'll move it down,
the goal of our research is essentially doing scene understanding and as well as
sort of capturing scenes and reproducing them for remote listening, either sort of
contemporaneously or sort of for later listening.
And of course if you want to understand scenes and sort of -- it's quite often
much more advantageous to use both the visual information and the auditory
information because audio and vision often provide sort of complimentary
information.
And it's because sound travels relatively slowly and it sort it's able to capture lots
of information in it's time varying signal. On the other hand, light is sort of very
good for geometric information because it's a lot of pinpoint and so on. But on
the other hand, sound sort of does not suffer as much from occlusion as light, so
there's lots of information in both. And often if you use both modalities, you get
more bang for your buck.
So the kind of information people want to capture of sound is of course speech
and sort of non speech [inaudible] and there is sort of a tremendous amount of
work which has gone on in speech, speech technician, automatic speech
technician and so on. But most of that is with close talking speech. So speech
captured by microphones right next to you.
But there's other information in sound so for example where the source comes
from. And this includes both the direction and the range of the source. And the
sound information which you receive in a room are outdoors, often also captures
information about the ambiance. It has information about the reverberant
structure, the materials of the room, the size of the room and so on. All that
information is available in the sound which you receive. If you have the
knowledge and location of the ambiance of the source, location and of the room
ambience, that can also help you in sort of get -- extracting the information, sort
of improving speech processing if you have sort of distant collection. Okay.
So sort of a broad team of research is combining microphone arrays and
cameras. And what we like to think is that our -- in our approach to doing
especially the audio processing part of our work, we differ from many previous -many previous authors in the sense that especially as far as audio is concerned
when audio and video processing are done together, usually they're done
separately. And integrations sort of happen between audio and video after the
processing modalities sort of you've completed your job in both modalities, then
you sort of fuse results.
So in -- especially in this work, what we try to do is we try to treat the audio also
as a geometry sensor and thus as a camera and try to street audio and video in a
joint analysis framework.
So this is sort of lots of tall claims made initially. So let's sort of see what we
want to do here. So as I mentioned one of our sort of interests is source
localization. And when we do source localization by audio, you want to use
microphone arrays and there are many approaches to doing source localization,
using microphone arrays. And the oldest techniques are based on sort of solving
geometric non-linear estimation problems. So essentially you know time delays
of arrival, for example between sources, between microphones, and you try to
solve for the source location by solving some non-linear estimation problem.
But this approach, especially in the presence of noise and reverberations is
known to be inaccurate. And another way which people have used in the
literature is by doing sort of looking at what's called steered response power. So
the idea is you use your microphone array and you hypothesize that there's a
source at some given location, you steer your microphone to point at a particular
location, and if your hypothesis is correct, you get some gain in the signal
received and then you sort of repeat this test and sort of look procedure for
many, many different locations and if you then so choose, you can sort of display
this image -- this sequence of beams steered as an image, and you end up with
an intensity image of sound arriving from particular points in space.
And of course this approach is interesting because it's sort of less prone to noise
and potentially more accurate and moreover you can incorporate a priori
constraints.
So the disadvantages, your costs rise as the number of look locations increase
because essentially you have to sum up all the microphone signals, multiply
them by certain weights, and potentially if you are doing a frequency dependent
algorithm, you have to do it separately for each frequency, and this gets
extremely expensive.
So this is -- I'm still sort of staying in an introductory phase. Now what I'm going
to do is we're going to do this, we're going to follow this approach. We're going
to do beamforming, and we're going to sort of point beams at many, many
directions. But we are going to use a special microphone setup. And this is the
spherical microphone array.
So the spherical microphone array is an interesting object. Essentially what we
have is you have a spherical solid surface and this spherical solid surface you
can sort of imagine on its surface there are a bunch of microphones. In an ideal
case you have sort of a pressure sensitive surface. And it turns out that for this
surface you can use the principle of acoustic reciprocity and you can very easily
construct beam patterns for any arbitrary loop direction. So suppose you have a
plane wave arriving from a particular direction, theta K and phi K? You can solve
the equation for sort of sound scattering of the surface of the sphere and you can
get the solution for the scattered sound field which is sort of given here. And this
gives you essentially for a given plane wave what the sound received at any
point on the surface of the sphere would be, so that would be the sound which
would be recorded by a microphone, which would be placed flush on the surface
of the array.
And in acoustics, there's this wonderful principle. It's also that in light which is
called Helmholtz reciprocity. Just knowing this solution you can also
automatically find what the beamformer waves would be to sort of do the
beamforming in this direction, theta K and phi K, so essentially you can find the
weights of -- by which you need to multiply the recorded signals to get back the -to get the response in a particular direction theta K phi K.
So the nice thing about this structure is that the beamformer waves can be sort of
factored in a way that you can get essentially a beam pattern which looks like
spherical harmonics. So what are spherical harmonics? So fear cal harmonics
are just like sort of [inaudible] on the surface of a sphere. So in two dimensions
just like regular [inaudible] are sort of a basis on the circle, spherical harmonics
are sort of doubly periodic functions which form a basis on the surface of the
sphere and essentially any square integral function on the surface of a sphere
can be expanded in a series in terms of spherical harmonics. These harmonics
looks -- so you have one frequency parameter for [inaudible] CDs, for spherical
harmonics, you have two frequency parameters. So in one direction sort of you
increase the frequency in this way so essentially this is not south along the
latitude, and in the second -- as you sort of increase the second index, you have
a long the [inaudible] the order of the CDs increasing. Okay? And any function
can be expanded in terms the of spherical harmonics. And you can get -- so
because now the spherical array you know the weights corresponding to get
spherical harmonic beam pattern in a particular direction, you can now
essentially compute automatically the beam pattern for any particular shape
you'd like for the beam.
And this is sort of a plug for our book which I throw a couple of times in every
talk. So I have to push the sales up. Okay. So now let's sort of step back and
see how you do spherical array beamforming. So suppose you want to find the
beam response of this array in some particular direction theta and you have sort
of recorded signals now at S microphones which is spread on the surface of the
sphere. You compute these weights and you sum up and you get the response
in that particular direction. And these weights themselves in general are sort of
an infinite set of -- they involve an infinite summation but since you will not have
-- so that would correspond to the case where you actually of an infinity of
microphones on the surface of the sphere as opposed to a finite set S.
So instead you have to sort of truncate the sum at some network system N minus
1, which is related to this number of microphones S you have.
And so that is related to the number of microphones you have and this order N
minus 1. So the number of coefficients you end up with is sort of proportional to
the number of microphones you have. So suppose you had 64 microphones you
could sort of end up for -- you truncate, you truncate here at -- so that zero
through 7. So you have 8. So you have 64 coefficients.
So the -- we did some sort of improvements to the original work of Elko which
sort of -- and Meyer which of created this spherical array, and Elko -- so this is
related to some technical issues involving quadrature on the surface of the
sphere. Elko and Meyers used particular designs for the spherical arrays which
required you to place the microphones at the locations of particular platonic
solids on the surface of the sphere. And this was related to how you could sort of
perform quadrature on the surface of the sphere. So the microphones had to be
at the location where you could perform quadrature on the sphere.
But it turned out that their locations of the microphones, if even one or two
microphones failed you have problems and you can't do beamforming with these
arrays.
We -- a previous PhD student of [inaudible] Lee developed sort of general
uniform layouts on the surface of microphone array. He's also at Microsoft. I
should have realized and -- but he's in the live -- in your group, I guess. I should
have called him. But anyway. So he developed this theory of quadrature on the
surface of the sphere using some previous work by lord Thomas where he sort of
was trying to develop a theory of electrons on the surface of a sphere and had
them all repelling each other. And it turns out that if you use these as your
locations of the microphones you get robustness with respect to quadrature.
And so then a few years ago, when Zion [phonetic] built this sphere where he
essentially took a lampshade and he took a bunch of microphones at the
locations corresponding to what this quadrature problem we then proceeded to
use this array for very -- for different thing. And here are some pictures of his
which show that if these four microphones are missing he still gets food
quadrature.
>>: So how could [inaudible].
>> Ramani Duraiswami: Yes. So essentially this arrangement -- so the thing is
what happens is this summation correspondence to an integral over the surface
of the sphere. And these waves -- so the location of these theta S, if you
remember from sort of your numerical methods the thing usually when you do
quadrature over some interval, if you choose your location of your quadrature
nodes, so for example people choose [inaudible] nodes along the line and they
get better thing. So just like you have some special nodes, these nodes are
selected to sort of minimize quadrature.
>>: Those nodes are like [inaudible] current basis?
>> Ramani Duraiswami: Yes. Yes. So they're sort of somehow optimally far
from each other in some sense. Okay. So then we proceeded to build different
arrays and so these are actual experimental beam patterns obtained with these
arrays so this is an order 5 beam pattern, so this is a main load and this is -- we
built also a hemispherical array, so this was for a video conferencing application.
So where we place the hemisphere on that table and you get the sphere is
completed by the image of the hemisphere of the surface of the -- in the surface
of the table, so you get for free you get a double the number of microphones
because of the image principle. And with the same number of microphones
you're able to get much higher order beam patterns.
So this is an 8th order beam pattern which is relatively tight.
>>: Roughly speaking you [inaudible].
>> Ramani Duraiswami: Microphones.
>>: Placing them in the optimal ->> Ramani Duraiswami: Right.
>>: Versus from the light.
>> Ramani Duraiswami: Right.
>>: How much [inaudible] do you get?
>> Ramani Duraiswami: Oh, so.
>>: Ballpark.
>> Ramani Duraiswami: Ballpark. Okay. So if you have 64 microphones you
can get an eighth order beam pattern. So you can get sort of an N forward
improve -- square root of number of microphones, improvement in the gain of the
microphone array using the sphere. Okay?
So and we also use them to sort of track traffic as they went along sort of a point
it along and sort of take it along and so on. But now we sort of get to the subject
of the stock. What we wanted to do with this the microphone array was to use it
for audio imaging. So we just sort of return to the first team which I mentioned.
So suppose I now we use this microphone array and this is a source in a room,
say someone speaking. And we essentially digitally steer the beam at many
angles and at many [inaudible] so essentially at any -- so at this direction, at this
direction, at this direction, and so on. And make a map of the energy which is
received. You do -- so this is a [inaudible] projection. So this is like the sort of
earth map which is put out, so essentially the top spot is spread out. So just like
Antarctica and the Arctic are presented out in the real world you have a spread
out location. So this is this person, but you can also see all his reverberations in
that room as he's speaking because of the structure of this image.
So now our goal is to use the spherical microphone array and create a device
which can create this kind of image continuously in -- at frame rates and then use
it to reason about the structure of audio syncs.
>>: I have a question along ->> Ramani Duraiswami: Yes?
>>: So what we see besides the main image almost kind of light blue spots, what
is these? Those are the side loads or those are actually reflections?
>> Ramani Duraiswami: Those are actual reflections. There will be some side
load contamination, also, but the side loads are relatively week in this array. So
there is some -- so that is like a point spread function of a camera. There will be
some spreading. But most of these actually you can reason and you can figure
out that these are actually the reflections ->>: So you consider that, the lower part is reflection on the floor.
>> Ramani Duraiswami: Ceiling.
>>: And those are from the walls.
>> Ramani Duraiswami: Walls. Right.
>>: Wow.
>> Ramani Duraiswami: And so we essentially transform the spherical array into
a camera for sound. Okay?
>>: So what kind of processing do you have to do?
>> Ramani Duraiswami: I will come in one second. So what is is processing? .
Essentially I will -- this is too many words, so essentially we have a ray from the
center of the camera in the loop direction, and we have that energy there, and
that is somehow spread by the beam pattern and side loads, also, but mainly it is
sort of from that particular direction and we create the image. Okay?
So then we came with an interesting observation that this image is what's called
a central projection image. So to those of you who have taken a computer vision
course, you know this is sort of the first thing you learn about when you learn
about imaging models that the image model which is used for cameras is that it's
a central projection camera image. So all rays of light pass through this camera
center which is the image center.
And this forms the basis for the geometric analysis of images and you essentially
can use the tools of projective geometry to analyze images. Okay? So now we
-- the images which we are producing with this audio camera also have this
central projection property essentially all the rays are going through the center of
the sphere, which is the imaging sphere.
And because of this, we can essentially -- we have the epipolar geometry. So
suppose now I image a scene using both an audio camera, so which is this guy
here, and a video camera. Then there is an epipolar geometry between the
audio camera and the video camera. So essentially suppose this is this person P
is being viewed by the video camera. You can -- and let's say P happens to be a
face detector which is run on the video camera. So now you know when I look at
the audio camera image that person can be absolutely anywhere, but in fact they
have to lie somewhere along this line in the audio camera space. Okay?
And you can use such constraints between two cameras, between three cameras
and so on, and you can borrow this from vision. And moreover, now if I have two
cameras and I know the correspondence at seven or eight points in the world,
then I have -- I can compute what's called a fundamental matrix between these
two cameras. And which then allows us for any new points to find the epipolar
line corresponding to this in the other image.
So of course when you have cameras you do calibration, so we built a calibration
target. So there's a pencil which has a tiny speaker and a tiny light source at the
end of it, and we simultaneously image it using the audio and video camera.
This is supposed to be an animation so somehow the jiff is not emanating. But
anyway.
And then you can sort of get the epipolar lines, so it's just sort of shown here
between the two cameras. And so here is the epipolar line from the sound
camera, sort of displays in the audio image. In the video image. And you can
see the sort of passing through the light source.
>>: [inaudible] light source and --
>> Ramani Duraiswami: And the sound source which are co-located. So now
we want to create these images in realtime, right? So we need to do some
processing to create these images in realtime. So the nice thing is of course this
beamforming is digital, the weights are known explicitly for each direction. I don't
have to -- I can sort of just use the [inaudible]. Each direction the beamforming is
independent. And there is some sort of some map you can either do it using time
domain signals or frequency domain signals depending on the application.
But this is sort of relatively expensive. That requires this sort of -- each pixel -so in light we are very lucky. Each pixel essentially gets almost for free the
image from a particular ray. But this is like if you had a camera and you had to
sort of sum up all the pixels outputs to get what is for a particular direction, this is
what's happening with the sound camera.
So now we have an expensive computation and -- yes?
>>: Furthermore you have to do it for each frequency?
>> Ramani Duraiswami: We can do it for each frequency, but there's also a time
domain formulation we can do. If -- which can save up something. I didn't go
into that, but there is a way to do it in time domain. But in -- usually we do it
frequency by frequency because the frequency representation gives us more
interesting ways of doing things. I'll come to that in a second.
So we need to somehow speed this up and get it running at frame rate to
produce video images. So of course we can use parallelism and it turns out that
the first setup which was sort of people did with this -- with this trying to do this
beamforming, they were doing too many extra computations which were not
necessary and we could use some special function tricks to reduce the number
of computations.
So it turns out that there was an inner sum which involves sum of spherical
harmonics. And this inner sum, it turns out we can use what's called a spherical
harmonic addition theorem and reduce this sort of N special function evaluation,
it's actually 2N special function evaluations and this sort of summation into just
one evaluation of the Legendre polynomial for one angle cosine of gamma.
So this essentially corresponds to the angle between the microphone location
and the loop direction. So if you have a table of loop directions, we can just sort
of store these angles corresponding to sort of specific loop directions in a table
and we don't need to in fact do this special function evaluation all the time.
Y and M are the spherical harmonic functions which I sort of showed a few slides
ago. So it reduces M multiply and adds to just one cosine evaluation. And it
turns out that there was another sort of complicated function which is here. This
guy here. And it turns out that if you sort of go back to look at sort of order
differential equation theory which one learns, there is an expression called the
Wronskian, and so which takes functions which are sort of basil functions which
correspond to J and H, and their products are just a simple it turns out is a simple
polynomial and that again that function evaluation can be also simplified and then
of course we can use parallel processing.
So now you have -- we've reduced the cost of each direction and we're going to
use the fact they're in parallel, and we are going to -- and in fact it's trivially
parallel to get really fast people forming in each direction. And if you speak of
parallel, you go to sort of graphics processors, and these graphics processors, so
this is now actually a little bit dated. This is couple of years old. This is the
NVIDIA 1800 GTX, and actually this is the one we use. But if you use the later
NVIDIA it's -- the speed is over here which is at a teraflop now and you can -- we
are able to run this thing at 100 frames a second with still some computation to
spare.
Of course Dell double or didn't used to sell these computers when we wanted
them, so we had to buy computers which had funky lights in them from these
game manufacturers.
>>: [Inaudible] hundred frames per second of what kind of resolution?
>> Ramani Duraiswami: Yes. The resolution is relatively course. It's related to
the -- so the resolution is about 10,000 pixels to get 4 pi coverage of the room.
>>: Does it also depend on the number of frequency you ->> Ramani Duraiswami: This was -- we were doing it at about 20 frequency
bangs which were covering -- and I skipped -- I glossed over some details. So
for example, the lower frequency bangs you cannot do very high order
beamforming. You can only do order two and order three in hundred hertz and
so on, but as you get to the kilohertz range, you can do a higher order
beamforming which gives you sort of more resolution images and those images
which I showed you were corresponding to the four kilohertz bang. So that's an
important distinction which I jumped over.
Okay. So now I'm going to show you some images of the sort of realtime thing.
So this is the image of how the calibration was done. So these essentially are
taking [inaudible] and so he's sending off the sound and at different locations so
once that's done you can do some interesting things. So once you have the
calibration done, the next thing you can do is you can do image transfer. So now
with [inaudible] I'm sort of scared to talk about image transfer, but anyway, I'll still
talk about image transfer.
So suppose I have two cameras and I assume that the world is far away, right?
So then just knowing the fundamental matrix, I can do image transfer between
two cameras. So and so I'm going to show you. So here Adam is going to
switch on a speaker so the top is the audio image, and we ignore the
fundamental matrix between these two cameras. So we're going to transfer the
audio image to the video image.
So we switch it on. And as soon as he switches it on, those pixels light up, which
is where the sound is coming from. And so now we sort of go through -- he's
speaking, he's flicking his thing and so on, and you can sort of find -- and you can
get an idea of the resolution. So that pixel in the regular image essentially comes
much, much bigger.
>>: [inaudible].
>> Ramani Duraiswami: The first colors are actually that we are showing our GB
images, our G and B are different frequency bands which are mapped to the
[inaudible] sound color.
>>: So the [inaudible] is what?
>> Ramani Duraiswami: Exact spot in this resolution.
>>: I see.
>> Ramani Duraiswami: That resolution is very loyal. Whole 10,000 pixels is
basically giving us 4 pi. And this is just a very small sort of field of view, and
you're looking at this ->>: [inaudible] smooth out there [inaudible].
>> Ramani Duraiswami: Yeah.
>>: Pretty slow.
>> Ramani Duraiswami: It's [inaudible].
>>: We're using a little bit of filtering so the image persists for a little bit but it
didn't sort of disappear after you snap [inaudible].
>>: [inaudible] speaker.
>> Ramani Duraiswami: So now let me show you some more applications. So
the intersection application, so here we can sort of consider that this application
was essentially audio is helping video to find out where the sound is coming
from. So now let's look for an application where video is helping audio. Yeah?
>>: I missed something. [inaudible] you would only have an [inaudible] the
camera and [inaudible] so you should have like [inaudible].
>> Ramani Duraiswami: So this is the point I was making that if you assume that
the world is part of a and it's on the surface where the objects are far away, then
if you know the two cameras and you assume that the sources you can get the
direction and you can project on to this direction.
To do it exactly correctly like you are saying with the range information, we will
need three cameras. Then we can do sort of image transfer between three
cameras. But if -- but this assumption actually works --
>>: So this [inaudible] camera reasonably close?
>> Ramani Duraiswami: They're reasonably close, yeah. And so now the -- I'm
going back to the same one. Okay. Okay. Now, in this application what we are
doing is we have Adam and we have the video camera and the audio camera.
And there's a very loud sound source, okay? And this sound source is music, it's
playing very loudly. And we are going to try to beam form, use the epipolar
space to help beamforming. So in this I'm going to play some sound so
essentially the first is sort of the sound without beamforming, and you cannot
make out what Adam is saying and then you'll just hear the sound which is
obtained by searching along the epipolar line for the peak. And in this case,
you'll sort of be able to make out what he's saying. And there was no sort of
clicks like Ivan [phonetic] uses of forced filtering and all this wonderful stuff used.
If you used those, that would be even better. Pure beamforming.
[music played].
>> Ramani Duraiswami: Now, with beamforming along the epipolar line.
[music played].
>>: [inaudible].
>> Ramani Duraiswami: That's the spherical microphone.
>>: So it's big?
>> Ramani Duraiswami: No, no, no, the one in the back is actually something
else totally.
>>: Oh, okay.
>> Ramani Duraiswami: But it has to do with [inaudible]. This is the [inaudible].
>>: Oh, okay.
>> Ramani Duraiswami: That little white string. That also is a microphone array
but that measures your [inaudible] transfer function. But that's in the lab.
>>: [inaudible] user source from the [inaudible].
>> Ramani Duraiswami: So both are relatively the same distance to the
microphone. So you can also do sort of things with multiple sources. So for
example, we did a lot of experiments with sort of two sources, how identifiable
they are. So those are all reported in paper which is in the transactions of audio
processing. This is impress how close they can be and you can still identify how
far apart and so on. Okay. Yeah.
>>: In that case [inaudible] direction you estimated. You're knot trying to
[inaudible].
>> Ramani Duraiswami: No, nothing. So this is very trivial beamforming, which
is sort of just the main load. Okay.
So the next idea was in audio most people know that if you have a room, you get
reflections from all the walls in the room and that affects listening quality.
Especially in concert haul acoustics people are very interested in designing the
concert so that you have good direct path. There are some early arriving
reflections which are sort of -- which have to be distinct, and later distinction
reverberations have to be sort of attenuated and so on. And this is indeed sort of
a black art which architecture acousticians do.
So could we use the audio camera somehow to help architecture acousticians?
So we went and measured essentially this is a very nice music hall which is at
the campus of the University of Maryland, and this is a panoramic image of this
place. And here we have placed a sound source on the stage, and this is a
spherical panorama which is sort of unwrapped here. And you can see this hall.
So now I'm going to play you a demo of a slow motion movie which we got. So
I'm going to now to display the spherical panorama in this fashion. And so now
this is on a sphere, so it sort of looks more reasonable. And the sound sources
look [inaudible]. Okay? So as I play this sound, it -- you can see the sound
come off. It's a very short chirp. And then you can see the sound reflect off the
walls. You can see the location of the reflections. You can see it's reflected off
the floor, off the back wall, and you can sort of -- this is a short 10 millisecond
chirp, but as you start going along you see that for several seconds essentially all
the reflections in slow motion as so you can see it's now reached the ceiling.
You can see the reflection of sound in all the locations.
Oh, by the way, this just -- these colors are sort of normalized so that in each
frame the maximum color is the same is red. So the maximum intensity is red.
Otherwise they will be attenuating and you will see sort of back drop colors.
So essentially many -- we've shown this to architecture acousticians and they are
very, very excited because apparently it takes them two months to fix a hall after
it's built and having such tools they can go and they get these things. Of course
we need a much more portable audio camera to do that, and we are sort of
working on that as I'll show you in a second.
>>: [inaudible] after many second it's still very coherent?
>> Ramani Duraiswami: Right. It is. And it's amazing to ->>: [inaudible] would have expected the whole room to eventually be filled with ->> Ramani Duraiswami: Yeah, it is. It -- if I keep going I sort of stopped it, but if I
sort of take it along as time goes on, you'll sort of see. It's not -- so there's so
many reflections all over the place as sort of at later times that ->>: So the whole time sequence is a section --
>> Ramani Duraiswami: It's, yeah, second -- it's more -- it's sort of -- it goes -the screen is not -- it goes to about two something. But the original chirp was
only several milliseconds [inaudible].
>>: [inaudible].
>> Ramani Duraiswami: Yes. Exactly.
>>: So we're seeing -- this is repeated in a loop or something?
>> Ramani Duraiswami: No, no, it's still going on.
>>: So how [inaudible] oh, I see. Okay.
[brief talking over].
>>: [inaudible] cut off a little bit. That's sort of a plot of what the maximum value
in that particular image is as [inaudible].
>>: So that's [inaudible] you see.
>>: [inaudible].
>>: Well, it's exactly difficult to identify whatever the reflection is.
>> Ramani Duraiswami: Yeah. It's hard to identify but now we're going to sort of
come to another use. So the next goal is of course we did this for concert hall so
can we use it in room acoustics somehow? So in room acoustics we all know
sort of reverberation is often treated as something to be not -- is not a friend. It's
bad, it sort of screws up algorithms and so on. There have been people since
the '80s and '90s who have sort of thought of doing what's called matched field
processing.
So the idea there is suppose I know the source and I know the room impulse
response. Then I can potentially invert this room impulse response and clean up
the sound field. Right? But then so the most famous person who sort of worked
a lot on this area is Jim Flanagan, and he essentially wrote several papers trying
to do this idea and even one of his students is also at Microsoft, but he's moved
to Live Labs or Search Labs [inaudible], and they worked on this area of trying to
do basically match field processing.
But what they found is even a small error in the room impulse response
essentially destroyed the results. Okay? So now the thought we had was is
there some way we can use this audio camera with the visual camera to figure
out which reflections are actually obtained at the microphone array and can we
use those in some way to improve the signal quality as well as do reverberation.
So this is essentially what I just said. So what I'm going to do is now going to
show you this movie. I guess it didn't launch. Or maybe I'll just go to the
directory.
>>: [inaudible].
>> Ramani Duraiswami: Okay. So let's -- I'll go, I'll show you the demo at the
end had sort of MATLAB recovers. Let's -- so what is the algorithm? So
essentially what we do is go to a room, we record different people speaking. So
that room in which we did this recording is about a third of this -- maybe a fifth of
this room size. It's a small conference room. And we had two sources speaking
simultaneously. And our goal was to essentially find the reflections which
corresponded to each source, used the spherical array and point the beam for
the direct -- at the location of interest and also find those reverberant reflections
which are visible to the array and which correspond to the source of interest and
clean up the sound while suppressing the other sound and simultaneously
[inaudible]. Okay?
So the first thing we needed was we needed a mechanism to figure out which
reflections correspond to source A and which reflections correspond to the
distracting source. And to do this, we need to sort of build a similarity function
which essentially looks at the beam form signal coming from the source of
interest, and all reflections and see if these come from the same source or they
come from a different source.
So we used MFCC features to do that. And so here is sort of images from the
audio camera. And this is with the images corresponding to one source, the
images corresponding to the second source, and the images corresponding to
the -- when both sources are present. And on the left, we are showing the MFCC
similarity metric which is computed on the beamformer output.
And the right two panels correspond to that similarity metric applied to the image
when we have both sources present. So essentially when we had both sources
present, the beam form -- the image should have looked something like this. But
we are able to suppress the -- these other reflections and only have those
reflexes which correspond to the female, an similarly for the male we are able to
suppress the female reverberations and get those reverberations which
correspond to the male identify.
>>: [inaudible].
>> Ramani Duraiswami: Okay. So this image is -- this image is the audio
camera image. This is for the male lone -- sorry, for the female alone, for the
male alone, and this is for both of them.
>>: So those other lighter ->> Ramani Duraiswami: Reflections.
>>: [inaudible] reflections.
>> Ramani Duraiswami: They are reflections. And here, this is the -- now we
run this MFCC similarity guy for each -- for every direction, and it tells us should
location, this location, this location, and this location and this location are the
locations of the reflections for the female. And we run -- so this was run first on
the case where we had the female alone and now we ran it also on the case
where we had both the female and male speakers and we are still able to sort of
locate the reflections which correspond to the female.
Likewise we did it for the male alone and for the male in the presence of the
female and again we are able to locate the reflections which correspond to the
male speaker. So now since we know -- so the idea of the algorithm is we beam
form to not only the source but also to the locations of the peak of the similarity
metric, and now we are going to do delay in some beamforming, put for the beam
form signals from the spherical image.
>>: So suppose [inaudible] using you know the regular [inaudible] how much
[inaudible].
>> Ramani Duraiswami: I don't know. But this is a very good question and this
is actually part of the presses. This is what we did last month. So I'm just -- and
we also in the -- during the course of this research, we also developed a very
good time delay estimate or -- and now we are sort of checking how good the -because it works very nicely in reverberant environment for performing sort of
delaying some beamforming, we need to figure out the delay between the
reflected image sources and so here is sort of the signal channel source. This is
the -- after the -- doing the proximate match filter our process and this is the
original. And we get significant improvement at least to perceptual quality but we
need to run some more tests to sort of give you quantitative say best scores or
something like that to tell you how much improvement we've had. Okay?
So now let's sort of look at the evolution of the hardware. So if you want to use
this guy and we want to sort of have many people use this guy, we need to give it
out in a portable format and have it go around. So the first one we built I said is
this spherical array, and you can see the and of wires coming out from each of
the microphones sort of here.
And of course the hemispherical area was nice because we had all the hole in
the table. We could sort of take this huge parade of wires and take it. Put this
makes the whole thing very unportable. So you want to get it portable. So about
2007, 2008, we tried to develop a portable version, and this is a 32 channel
array, which we develop and we moved all the analog to digital conversion
electronics into the center of the ball. And so this guy just has one power and
one USB wire coming out of it. And there is some FPDA stuff going on there
which essentially takes all the analog to digital and packs it into USB and sends it
out.
>>: [inaudible] fit into 2 channels.
>> Ramani Duraiswami: In one wire.
>>: In one wire.
>> Ramani Duraiswami: And now ->>: [inaudible].
>> Ramani Duraiswami: We -- this one is 12 bits, but we could to 2 point -- but
we are actually using very small percentage and of the available bandwidth. So
we can go to 256 if you wanted microphones at 24 bits. And that -- so now the
version we developed earlier, end of last year is now 64 channels. So this is a
five and a half inch sphere. And it has one camera which is -- which is also built
into it. So this is a single USB camera, and this is what -- so this is the camera
looking out. We can look up with this.
And now we sort of the license that design to a company in our area and these
guys actually now building an array which has six cameras which sort of stitch
panoramas at the same time as they're sort of taking the panoramic audio
images. And so you can sort of -- and it's much more rugged. These guys they
sort of built enclosures and so on.
And they are trying to ramp up and trying to sell these things to ->>: [inaudible].
>> Ramani Duraiswami: The radius is about 5.8 inches, which is about
[inaudible].
>>: [inaudible].
>> Ramani Duraiswami: They can to it smaller, too, but this is [inaudible] so I
sort of started a few minutes earlier, just sort of finished by just sort of saying
some other applications of this spherical area which we are doing. One is we are
also want to use it for remote reality reproduction. So we want to essentially
capture reality at some point. Say take new a concert hall and then have the
ability to place new that concert hall. So we capture the spherical array sound,
then we do a -- what's called we convert the sound field into a plain wave
representation. So we know the plain waves which are sort of impinging at that
location. And now if I know your head related transfer function I can place you at
that location by taking those plain waves and convolving them with your
[inaudible] transfer function. So that's -- and then this is the scene recording,
scene playback.
So as far as head related transfer functions are concerned, these are sort of
things, so I'm just going to sort of skip very quickly to this thing. So it turns out
that human beings when we hear sound we don't hear the actual source sound.
The sound which is actually received at our ears. So if this is the envelope of the
frequencies of the sound which was sort of from the source, the sound which is
actually received at your ear is quite different because some frequency are
enhanced and some are attenuated because of the process of scattering of your
own body. And so you are changing the color of the sound. Okay? So when
you are changing the color of -- in light this would be like taking a CD and moving
it in light. You'll get lots of colors in light because the wavelengths of light
correspond to the sort of bumps on the surface of the CD so you get these color
shimmy. So just like that when you move in sound, you are changing the color of
the sound which is coming [inaudible].
And the interesting thing is that this process of changing the color of the sound
which is different for evidence person. So every person has -- changes the color
of the sound which is received differently and this gives you sort of additional
cues to do source localization. And if I want to reproduce virtual reality and I
want to sort of give you very stable sources locations in sort of audio
presentation over head phones, I need to know this transfer function for every
individual.
And this measuring this transfer function and every individual shapes are
different and so the transfer function is different for every person and measuring
this transfer function used to be a relatively tedious affair. What would you do is
you would sort of take speaker, player, chirp from one location then play a chirp
from another location, another location and so on. And you'd sort of do this over
two hours and maybe a thousand directions, and that would give you this head
related transfer function.
But we use an idea which is often used in vision, also, which is called Helmholtz
reciprocity. This principle state says suppose I have a sound source at a given
location, let's say here, and I have a receiver here. If I swap the location of the
source and the receiver, I will get the same measurement at the receiver
location. So this is sort of showing you that in simulation even though the overall
acoustic feel will be totally different between the two points there will be this
principle of reciprocity.
So we use this idea to create -- to measure head related transfer function in few
seconds. So the idea is we place headphone drivers turned out in people's ears.
We use that 5:00 phone array which you saw in the background and we record
the receive signal for all directions in one as opposed to sort of doing direction by
direction recording and we can get this head related transfer function. And if you
compare the head related transfer function measured the two ways, we get the
same thing.
So this is related to sort of scene reproduction work. We also do sort of
computation of head related transfer function computation of scattering. So now
we are developing sort arrays which are not necessarily spherical shaped but say
head shaped or other shapes. So suppose I put microphones on some arbitrary
scattering object like a robot, a network of microphones? How can I beam form
with those microphones? So then I would have to solve the wave equation to get
those beam form arrays and we're sort of working on those.
But we're going sort of too many different directions so I think I'll just take this
opportunity to include and thank you all for inviting me here.
[applause].
>>: [inaudible].
>> Ramani Duraiswami: Is too low? It's -- we actually what we did is we first
sealed [inaudible] okay?
>>: [inaudible].
>> Ramani Duraiswami: Yeah. And then from the speakers. But since this is a
sort of important question, what we did is we also put a micro. We took a dummy
head and we put a microphone inside the dummy head and the sealed up
[inaudible] and we sort of checked what is a DB level of the sound. It was only
65 DB, which is much lower than any even sort of it's like conversational speech.
So this was -- otherwise, you know, it would be wonderful. You can measure
everyone's head related transfer function but they are deaf [laughter]. But they
are deaf at the end of the procedure which is ->>: [inaudible].
>> Ramani Duraiswami: Yes.
>>: [inaudible].
>> Ramani Duraiswami: Almost entirely geometry. A little bit [inaudible].
>>: [inaudible].
>> Ramani Duraiswami: It's stuck.
>>: Still stuck?
>>: So if you [inaudible] geometry of each.
>> Ramani Duraiswami: You can do -- you can compute it. But it's computations
relatively expensive still. First you and then you.
>>: Okay. So I'm wondering if the spherical -- I mean it's like [inaudible] what if I
[inaudible] 8 by 8 [inaudible].
>> Ramani Duraiswami: You can do the same kind of processing and create
those same kind of images. The only thing is especially with these real
geometries if you put it in the other or if you have -- so if you have all these edge
effects and to sort of get reliable beam weights to get sort of the same game in
the each direction and so on, it's harder to do with regular arrays. So for
example, edge array, the edge microphones have different weights, those kind of
things, so there are -- there are constraints.
But in principle if you could compute easily the beam weights corresponding to
each direction you can use any procedure to create those images. So the
spherical array buys you the sort of nice mathematical framework to get
[inaudible].
>>: I guess [inaudible] so it's like [inaudible].
>> Ramani Duraiswami: It's very similar to the [inaudible].
>>: The [inaudible] transfer. So it seems easier I mean it seems easy to think
about the spacing between microphones and -- big spacing that it's going to have
on [inaudible].
>> Ramani Duraiswami: That is correct. So for example, if I have more
microphones on my array, I can go to higher order, and I can get [inaudible] so
for example when we went from the 64 microphones on the hemisphere, we got
120 microphones so we could sort of get order 8 beams out of that setup,
whereas the regular 64 microphone array we could only get with [inaudible].
Okay. So this is the demo I was trying to show you which sort of died. This is
the -- Adam, can you sort of [inaudible] around the room. So this is the room in
which the experiment for -- this is a -- so what we did is the experiment had two
speakers and we played database frames.
>>: [inaudible].
>> Ramani Duraiswami: It was all cued up. Okay. So here is the room. You
can sort of -- there's a table in the middle. So it has a very complex structure.
So if you wanted to estimate is impulse response using simple geometry and so
on it would be very hard. So now play a sound source. So this is one. Yeah.
So that's one speaker playing a sound source. You can sort of see it's reflecting
of the table of the wall, of the side wall. There's a second order reflection on that
wall of the white board and so on. There are all these reflections.
So the goal is of that idea somehow to use these reflections to do [inaudible].
>>: [inaudible].
>> Ramani Duraiswami: The room is about this direction 24 feet and widthwise
it's about 11 feet. So this dimension is 11 feet, and this dimension along the
table is about 24 feet.
>>: This is also continuous sound.
>> Ramani Duraiswami: So this is continuous sound because the sound is sort
of bumping into the room. It's not a chirp. It's a continuous speech.
>>: So how come when you see it coming and going [inaudible].
>>: Yeah, it [inaudible].
>>: It's continuous noise so -- if you could see that [inaudible] this resolution, you
can see the [inaudible]. What we're showing in this particular image is just a high
frequency [inaudible] and so you see in the energy is that when the energy
happens to drop off a sort of randomly generated signal that's when you start to
see the reflections. So of course they can't see that plot at all.
>>: So what you'll see is you'll have short bursts of 5 or 10 milliseconds where
the energy in the 4,000 hertz band is dropped off and that's sort of these points
where you see the reflections come.
>> Ramani Duraiswami: So the of course speech when people speak there is
sort of interruptions and so on. But at the same time it's -- it doesn't decay like in
the other case because the personalizing keeps repeating speech.
>>: [inaudible] so it's only, you know, five meters across. Five meter across the
[inaudible] so if it's a continuous sound, why aren't we just seeing it like emit from
the direct path and then see the ->> Ramani Duraiswami: Okay. You to see the direct path so you can see the
direct path go off and come on, go off and come on.
>>: But this is not really [inaudible] this is [inaudible] no.
>> Ramani Duraiswami: No.
>>: This is very slow so ->>: This is like [inaudible] that's the type of [inaudible], right, right.
>>: So why don't we see the direct path go off.
>>: Because the -- that speech segment got over.
>>: Oh, okay. It's speech segment.
>>: Okay.
>>: It would be a little more clear if they hadn't chopped the graph on the top of
the energy but fortunately I guess try and use this program on low resolution it
[inaudible].
>>: It's easier you [inaudible] short chirp.
>> Ramani Duraiswami: Right.
>>: And a chirp would disappear.
>> Ramani Duraiswami: Put the chirp gives you the impulse response. But it's
sort of -- what this will actually -- if you play that chirp of course you could use
that to identify the impulse response. You could do that. But this -- so the goal
here was not to do that, the goal here was to say suppose we just take this
device and we know nothing about that thing, we don't have the ability to play
chirps, we don't have ->>: Right, right.
>> Ramani Duraiswami: How can we clean up the speech and record?
>>: So the reverberation time, we didn't get into the reverberation there?
>>: It was about two, 200, 300 millisecond.
>>: 300 millisecond to [inaudible].
>>: Well, because the signal was sort of continuous, you will always see
[inaudible].
>>: Well, it looked like -- I mean we didn't see a lot of color from other places, so.
>>: Yeah, most of it was the direct -- if you look at sort of the accumulation over
a short period of time, that's when you get the image from the previous ->> Ramani Duraiswami: So you can sort of see the other side.
>>: Where you get sort of the spots that correspond to actually reflections
become persistent if you sort of average these images and keep a running
average that you can see where all the people [inaudible] are actually.
>> Ramani Duraiswami: Accumulated.
>>: Accumulated. And that's what we use in ->>: But things aren't just sort of becoming ambient.
>> Ramani Duraiswami: No.
>>: Reverberation though, right.
>> Ramani Duraiswami: No.
>>: I mean presumably if you [inaudible] quick impulse we would see some level
of ambience, but it would happen much quicker than.
>> Ramani Duraiswami: And at the same time as you played some continuous
noise and then you may see all these sort of standing waves develop and swarm.
That also we didn't do yet.
>>: [inaudible].
>> Ramani Duraiswami: Right.
>>: [inaudible].
>> Ramani Duraiswami: Right.
>>: [inaudible] source come from.
>> Ramani Duraiswami: Right.
>>: [inaudible].
>> Ramani Duraiswami: Right.
>>: So there are a lot of people.
>> Ramani Duraiswami: Yes, so we only tried so far with two sources. And we
are able to distinguish the reflections of the two sources.
>>: I guess like you try to [inaudible].
>> Ramani Duraiswami: Yeah.
>>: Over [inaudible] is the [inaudible].
>> Ramani Duraiswami: And so there are many people who are sort of research.
So we would have to build stronger metrics which are not based on just MFCC
maybe something else but anyway. This is -- but this is sort of we will studied
field. We have to borrow their ideas. But this is just the first parts.
>>: So I have a question. Microphone array as a camera.
>> Ramani Duraiswami: Right.
>>: I [inaudible] I wonder didn't actually have [inaudible].
>> Ramani Duraiswami: So we are working on some concept which is the focal
length of course is sort of the radius of the array, which is the -- put we are
working -- your question is sort of interesting. Can, for example, can we build a
telephoto lens of this camera?
>>: Right.
>> Ramani Duraiswami: And we are working on some ideas like that, trying to
see if we can -- so that's work in progress. If it happens we will -- yeah?
>>: You mentioned that companies going to build a product of this [inaudible].
>> Ramani Duraiswami: Yes.
>>: [inaudible].
>> Ramani Duraiswami: So the -- they're going to build five next month. The
first five units will be built next month. And then they have sort of two or three
potential customers. If they are able to sell to those, they will continue building.
Otherwise.
>>: So the [inaudible].
>> Ramani Duraiswami: Yes. Yes?
>>: So out of that device are you going to get discreet channels or will there be
[inaudible].
>> Ramani Duraiswami: So that device is just going to send you in a UBS. The
two UBSs. So one will have the image stream, which is the camera.
>>: Camera.
>> Ramani Duraiswami: Camera. And the other will have the sound which is
sort of packed channel by channel.
>>: Okay. So in the computer then would be the.
>> Ramani Duraiswami: The processing.
>>: To the image?
>> Ramani Duraiswami: Right.
>>: Okay.
>>: Just to make sure on this then, we are doing the spherical you're just
transmitting the transfer function from that direction to that one.
>> Ramani Duraiswami: Yes.
>>: And then you [inaudible].
>> Ramani Duraiswami: Right. So the -- no. Actually not transfer function.
What we are doing is we are doing so for example because we have the
[inaudible] channel, so the way this would be used with actual speaker is we
would run a -- we would run a face detector we know by the face detector what
are the locations of the people.
>>: Suppose you are [inaudible].
>> Ramani Duraiswami: Yes.
>>: [inaudible].
>> Ramani Duraiswami: So then once I know the primary direction of the source,
I take the beam form sound from that direction.
>>: But the question then how do you get [inaudible].
>> Ramani Duraiswami: Using the spherical array and the knowledge of the
source location.
>>: I know. Why are you using the source location [inaudible].
>> Ramani Duraiswami: Yes.
>>: So what did you ->> Ramani Duraiswami: So now I run my spherical array beam former.
>>: Once the coefficient [inaudible].
>>: [inaudible].
>> Ramani Duraiswami: Those W weights with [inaudible].
>>: [inaudible].
>> Ramani Duraiswami: They are based on plain wave scattering of the
spherical array from that particular dimension.
>>: [inaudible] the particular microphone is going to have a [inaudible].
>> Ramani Duraiswami: Right.
>>: Then you revert that and then [inaudible].
>> Ramani Duraiswami: Right. Precisely. From that direction. I know the
solution for that direction. And ->>: [inaudible].
>> Ramani Duraiswami: Well, we don't need to calibrate too much. If you
calibrate -- so for example, there are some corrections we could supply. So
these sources -- these rates correspond to infinite -- to far field waves. If you
want, you can sort of connect those weights for near field weights and there are
procedures to do that but we don't do that.
>>: I see. So if you are sort of within those weights the corrections would have
to happen per our sort of size array. If you're closer than a meter to the array. If
you're more than a meter away you can still use the far field weights.
>>: [inaudible].
>> Ramani Duraiswami: Sorry?
>>: [inaudible].
>> Ramani Duraiswami: You would have to collect both the gain and the
correction. So there are -- this question has been studied a lot in the literature for
head related transfer function because this is exactly the same thing again which
is studied -- which is measured in that context. And it turns out that there are
approximate geometric expression you can get how to collect the phase, how to
collect -- how to collect the gain. But if you are beyond a meter and our fear is
about head side, if you are beyond that meter you are -- you don't need to collect
those. It's sort of less than [inaudible] DB, whatever [inaudible].
>> Cha Zhang: Okay. Let's thank the speaker again.
[applause]
Download