>> Cha Zhang: Good afternoon. It's my great... Ramani Duraiswami and Adam O'Donovan to give a talk audio...

>> Cha Zhang: Good afternoon. It's my great pleasure to introduce Professor Ramani Duraiswami and Adam O'Donovan to give a talk audio cameras for audio visual scene analysis. Professor Ramani Duraiswami is an associate professor in the department of computer science in the institute for advanced computer studies at the university of Maryland College Park. He obtained his bachelor of technology degree from IIT Bombay and PhD from John Hopkins University. He currently directs research at the Perceptual Interfaces and Reality Laboratory at the University of Maryland. His current research interest can include areas in audio for virtual reality, human computer interaction and scientific computing, such as multicore, GPU and stuff and the computational machine learning and vision. Adam O'Donovan is a PhD candidate and graduate research assistant at department of computer science at the University of Maryland. He has BS degree in physics and computer science from Maryland. He received the NVIDIA fellowship for 2008 and 2009 and University of Maryland Prime fellowship in 2007 to 2008. He has been interned in a couple places, including Microsoft Research that was last year. So without further ado, let's welcome. >> Ramani Duraiswami: Thank you, Cha. [applause]. >> Ramani Duraiswami: So, I'm going to talk about some recent we've been according in -- and device we developed which we call the audio camera. And so our goal, the goal of our research, I guess it's a bit loud so I'll move it down, the goal of our research is essentially doing scene understanding and as well as sort of capturing scenes and reproducing them for remote listening, either sort of contemporaneously or sort of for later listening. And of course if you want to understand scenes and sort of -- it's quite often much more advantageous to use both the visual information and the auditory information because audio and vision often provide sort of complimentary information. And it's because sound travels relatively slowly and it sort it's able to capture lots of information in it's time varying signal. On the other hand, light is sort of very good for geometric information because it's a lot of pinpoint and so on. But on the other hand, sound sort of does not suffer as much from occlusion as light, so there's lots of information in both. And often if you use both modalities, you get more bang for your buck. So the kind of information people want to capture of sound is of course speech and sort of non speech [inaudible] and there is sort of a tremendous amount of work which has gone on in speech, speech technician, automatic speech technician and so on. But most of that is with close talking speech. So speech captured by microphones right next to you. But there's other information in sound so for example where the source comes from. And this includes both the direction and the range of the source. And the sound information which you receive in a room are outdoors, often also captures information about the ambiance. It has information about the reverberant structure, the materials of the room, the size of the room and so on. All that information is available in the sound which you receive. If you have the knowledge and location of the ambiance of the source, location and of the room ambience, that can also help you in sort of get -- extracting the information, sort of improving speech processing if you have sort of distant collection. Okay. So sort of a broad team of research is combining microphone arrays and cameras. And what we like to think is that our -- in our approach to doing especially the audio processing part of our work, we differ from many previous -many previous authors in the sense that especially as far as audio is concerned when audio and video processing are done together, usually they're done separately. And integrations sort of happen between audio and video after the processing modalities sort of you've completed your job in both modalities, then you sort of fuse results. So in -- especially in this work, what we try to do is we try to treat the audio also as a geometry sensor and thus as a camera and try to street audio and video in a joint analysis framework. So this is sort of lots of tall claims made initially. So let's sort of see what we want to do here. So as I mentioned one of our sort of interests is source localization. And when we do source localization by audio, you want to use microphone arrays and there are many approaches to doing source localization, using microphone arrays. And the oldest techniques are based on sort of solving geometric non-linear estimation problems. So essentially you know time delays of arrival, for example between sources, between microphones, and you try to solve for the source location by solving some non-linear estimation problem. But this approach, especially in the presence of noise and reverberations is known to be inaccurate. And another way which people have used in the literature is by doing sort of looking at what's called steered response power. So the idea is you use your microphone array and you hypothesize that there's a source at some given location, you steer your microphone to point at a particular location, and if your hypothesis is correct, you get some gain in the signal received and then you sort of repeat this test and sort of look procedure for many, many different locations and if you then so choose, you can sort of display this image -- this sequence of beams steered as an image, and you end up with an intensity image of sound arriving from particular points in space. And of course this approach is interesting because it's sort of less prone to noise and potentially more accurate and moreover you can incorporate a priori constraints. So the disadvantages, your costs rise as the number of look locations increase because essentially you have to sum up all the microphone signals, multiply them by certain weights, and potentially if you are doing a frequency dependent algorithm, you have to do it separately for each frequency, and this gets extremely expensive. So this is -- I'm still sort of staying in an introductory phase. Now what I'm going to do is we're going to do this, we're going to follow this approach. We're going to do beamforming, and we're going to sort of point beams at many, many directions. But we are going to use a special microphone setup. And this is the spherical microphone array. So the spherical microphone array is an interesting object. Essentially what we have is you have a spherical solid surface and this spherical solid surface you can sort of imagine on its surface there are a bunch of microphones. In an ideal case you have sort of a pressure sensitive surface. And it turns out that for this surface you can use the principle of acoustic reciprocity and you can very easily construct beam patterns for any arbitrary loop direction. So suppose you have a plane wave arriving from a particular direction, theta K and phi K? You can solve the equation for sort of sound scattering of the surface of the sphere and you can get the solution for the scattered sound field which is sort of given here. And this gives you essentially for a given plane wave what the sound received at any point on the surface of the sphere would be, so that would be the sound which would be recorded by a microphone, which would be placed flush on the surface of the array. And in acoustics, there's this wonderful principle. It's also that in light which is called Helmholtz reciprocity. Just knowing this solution you can also automatically find what the beamformer waves would be to sort of do the beamforming in this direction, theta K and phi K, so essentially you can find the weights of -- by which you need to multiply the recorded signals to get back the -to get the response in a particular direction theta K phi K. So the nice thing about this structure is that the beamformer waves can be sort of factored in a way that you can get essentially a beam pattern which looks like spherical harmonics. So what are spherical harmonics? So fear cal harmonics are just like sort of [inaudible] on the surface of a sphere. So in two dimensions just like regular [inaudible] are sort of a basis on the circle, spherical harmonics are sort of doubly periodic functions which form a basis on the surface of the sphere and essentially any square integral function on the surface of a sphere can be expanded in a series in terms of spherical harmonics. These harmonics looks -- so you have one frequency parameter for [inaudible] CDs, for spherical harmonics, you have two frequency parameters. So in one direction sort of you increase the frequency in this way so essentially this is not south along the latitude, and in the second -- as you sort of increase the second index, you have a long the [inaudible] the order of the CDs increasing. Okay? And any function can be expanded in terms the of spherical harmonics. And you can get -- so because now the spherical array you know the weights corresponding to get spherical harmonic beam pattern in a particular direction, you can now essentially compute automatically the beam pattern for any particular shape you'd like for the beam. And this is sort of a plug for our book which I throw a couple of times in every talk. So I have to push the sales up. Okay. So now let's sort of step back and see how you do spherical array beamforming. So suppose you want to find the beam response of this array in some particular direction theta and you have sort of recorded signals now at S microphones which is spread on the surface of the sphere. You compute these weights and you sum up and you get the response in that particular direction. And these weights themselves in general are sort of an infinite set of -- they involve an infinite summation but since you will not have -- so that would correspond to the case where you actually of an infinity of microphones on the surface of the sphere as opposed to a finite set S. So instead you have to sort of truncate the sum at some network system N minus 1, which is related to this number of microphones S you have. And so that is related to the number of microphones you have and this order N minus 1. So the number of coefficients you end up with is sort of proportional to the number of microphones you have. So suppose you had 64 microphones you could sort of end up for -- you truncate, you truncate here at -- so that zero through 7. So you have 8. So you have 64 coefficients. So the -- we did some sort of improvements to the original work of Elko which sort of -- and Meyer which of created this spherical array, and Elko -- so this is related to some technical issues involving quadrature on the surface of the sphere. Elko and Meyers used particular designs for the spherical arrays which required you to place the microphones at the locations of particular platonic solids on the surface of the sphere. And this was related to how you could sort of perform quadrature on the surface of the sphere. So the microphones had to be at the location where you could perform quadrature on the sphere. But it turned out that their locations of the microphones, if even one or two microphones failed you have problems and you can't do beamforming with these arrays. We -- a previous PhD student of [inaudible] Lee developed sort of general uniform layouts on the surface of microphone array. He's also at Microsoft. I should have realized and -- but he's in the live -- in your group, I guess. I should have called him. But anyway. So he developed this theory of quadrature on the surface of the sphere using some previous work by lord Thomas where he sort of was trying to develop a theory of electrons on the surface of a sphere and had them all repelling each other. And it turns out that if you use these as your locations of the microphones you get robustness with respect to quadrature. And so then a few years ago, when Zion [phonetic] built this sphere where he essentially took a lampshade and he took a bunch of microphones at the locations corresponding to what this quadrature problem we then proceeded to use this array for very -- for different thing. And here are some pictures of his which show that if these four microphones are missing he still gets food quadrature. >>: So how could [inaudible]. >> Ramani Duraiswami: Yes. So essentially this arrangement -- so the thing is what happens is this summation correspondence to an integral over the surface of the sphere. And these waves -- so the location of these theta S, if you remember from sort of your numerical methods the thing usually when you do quadrature over some interval, if you choose your location of your quadrature nodes, so for example people choose [inaudible] nodes along the line and they get better thing. So just like you have some special nodes, these nodes are selected to sort of minimize quadrature. >>: Those nodes are like [inaudible] current basis? >> Ramani Duraiswami: Yes. Yes. So they're sort of somehow optimally far from each other in some sense. Okay. So then we proceeded to build different arrays and so these are actual experimental beam patterns obtained with these arrays so this is an order 5 beam pattern, so this is a main load and this is -- we built also a hemispherical array, so this was for a video conferencing application. So where we place the hemisphere on that table and you get the sphere is completed by the image of the hemisphere of the surface of the -- in the surface of the table, so you get for free you get a double the number of microphones because of the image principle. And with the same number of microphones you're able to get much higher order beam patterns. So this is an 8th order beam pattern which is relatively tight. >>: Roughly speaking you [inaudible]. >> Ramani Duraiswami: Microphones. >>: Placing them in the optimal ->> Ramani Duraiswami: Right. >>: Versus from the light. >> Ramani Duraiswami: Right. >>: How much [inaudible] do you get? >> Ramani Duraiswami: Oh, so. >>: Ballpark. >> Ramani Duraiswami: Ballpark. Okay. So if you have 64 microphones you can get an eighth order beam pattern. So you can get sort of an N forward improve -- square root of number of microphones, improvement in the gain of the microphone array using the sphere. Okay? So and we also use them to sort of track traffic as they went along sort of a point it along and sort of take it along and so on. But now we sort of get to the subject of the stock. What we wanted to do with this the microphone array was to use it for audio imaging. So we just sort of return to the first team which I mentioned. So suppose I now we use this microphone array and this is a source in a room, say someone speaking. And we essentially digitally steer the beam at many angles and at many [inaudible] so essentially at any -- so at this direction, at this direction, at this direction, and so on. And make a map of the energy which is received. You do -- so this is a [inaudible] projection. So this is like the sort of earth map which is put out, so essentially the top spot is spread out. So just like Antarctica and the Arctic are presented out in the real world you have a spread out location. So this is this person, but you can also see all his reverberations in that room as he's speaking because of the structure of this image. So now our goal is to use the spherical microphone array and create a device which can create this kind of image continuously in -- at frame rates and then use it to reason about the structure of audio syncs. >>: I have a question along ->> Ramani Duraiswami: Yes? >>: So what we see besides the main image almost kind of light blue spots, what is these? Those are the side loads or those are actually reflections? >> Ramani Duraiswami: Those are actual reflections. There will be some side load contamination, also, but the side loads are relatively week in this array. So there is some -- so that is like a point spread function of a camera. There will be some spreading. But most of these actually you can reason and you can figure out that these are actually the reflections ->>: So you consider that, the lower part is reflection on the floor. >> Ramani Duraiswami: Ceiling. >>: And those are from the walls. >> Ramani Duraiswami: Walls. Right. >>: Wow. >> Ramani Duraiswami: And so we essentially transform the spherical array into a camera for sound. Okay? >>: So what kind of processing do you have to do? >> Ramani Duraiswami: I will come in one second. So what is is processing? . Essentially I will -- this is too many words, so essentially we have a ray from the center of the camera in the loop direction, and we have that energy there, and that is somehow spread by the beam pattern and side loads, also, but mainly it is sort of from that particular direction and we create the image. Okay? So then we came with an interesting observation that this image is what's called a central projection image. So to those of you who have taken a computer vision course, you know this is sort of the first thing you learn about when you learn about imaging models that the image model which is used for cameras is that it's a central projection camera image. So all rays of light pass through this camera center which is the image center. And this forms the basis for the geometric analysis of images and you essentially can use the tools of projective geometry to analyze images. Okay? So now we -- the images which we are producing with this audio camera also have this central projection property essentially all the rays are going through the center of the sphere, which is the imaging sphere. And because of this, we can essentially -- we have the epipolar geometry. So suppose now I image a scene using both an audio camera, so which is this guy here, and a video camera. Then there is an epipolar geometry between the audio camera and the video camera. So essentially suppose this is this person P is being viewed by the video camera. You can -- and let's say P happens to be a face detector which is run on the video camera. So now you know when I look at the audio camera image that person can be absolutely anywhere, but in fact they have to lie somewhere along this line in the audio camera space. Okay? And you can use such constraints between two cameras, between three cameras and so on, and you can borrow this from vision. And moreover, now if I have two cameras and I know the correspondence at seven or eight points in the world, then I have -- I can compute what's called a fundamental matrix between these two cameras. And which then allows us for any new points to find the epipolar line corresponding to this in the other image. So of course when you have cameras you do calibration, so we built a calibration target. So there's a pencil which has a tiny speaker and a tiny light source at the end of it, and we simultaneously image it using the audio and video camera. This is supposed to be an animation so somehow the jiff is not emanating. But anyway. And then you can sort of get the epipolar lines, so it's just sort of shown here between the two cameras. And so here is the epipolar line from the sound camera, sort of displays in the audio image. In the video image. And you can see the sort of passing through the light source. >>: [inaudible] light source and -- >> Ramani Duraiswami: And the sound source which are co-located. So now we want to create these images in realtime, right? So we need to do some processing to create these images in realtime. So the nice thing is of course this beamforming is digital, the weights are known explicitly for each direction. I don't have to -- I can sort of just use the [inaudible]. Each direction the beamforming is independent. And there is some sort of some map you can either do it using time domain signals or frequency domain signals depending on the application. But this is sort of relatively expensive. That requires this sort of -- each pixel -so in light we are very lucky. Each pixel essentially gets almost for free the image from a particular ray. But this is like if you had a camera and you had to sort of sum up all the pixels outputs to get what is for a particular direction, this is what's happening with the sound camera. So now we have an expensive computation and -- yes? >>: Furthermore you have to do it for each frequency? >> Ramani Duraiswami: We can do it for each frequency, but there's also a time domain formulation we can do. If -- which can save up something. I didn't go into that, but there is a way to do it in time domain. But in -- usually we do it frequency by frequency because the frequency representation gives us more interesting ways of doing things. I'll come to that in a second. So we need to somehow speed this up and get it running at frame rate to produce video images. So of course we can use parallelism and it turns out that the first setup which was sort of people did with this -- with this trying to do this beamforming, they were doing too many extra computations which were not necessary and we could use some special function tricks to reduce the number of computations. So it turns out that there was an inner sum which involves sum of spherical harmonics. And this inner sum, it turns out we can use what's called a spherical harmonic addition theorem and reduce this sort of N special function evaluation, it's actually 2N special function evaluations and this sort of summation into just one evaluation of the Legendre polynomial for one angle cosine of gamma. So this essentially corresponds to the angle between the microphone location and the loop direction. So if you have a table of loop directions, we can just sort of store these angles corresponding to sort of specific loop directions in a table and we don't need to in fact do this special function evaluation all the time. Y and M are the spherical harmonic functions which I sort of showed a few slides ago. So it reduces M multiply and adds to just one cosine evaluation. And it turns out that there was another sort of complicated function which is here. This guy here. And it turns out that if you sort of go back to look at sort of order differential equation theory which one learns, there is an expression called the Wronskian, and so which takes functions which are sort of basil functions which correspond to J and H, and their products are just a simple it turns out is a simple polynomial and that again that function evaluation can be also simplified and then of course we can use parallel processing. So now you have -- we've reduced the cost of each direction and we're going to use the fact they're in parallel, and we are going to -- and in fact it's trivially parallel to get really fast people forming in each direction. And if you speak of parallel, you go to sort of graphics processors, and these graphics processors, so this is now actually a little bit dated. This is couple of years old. This is the NVIDIA 1800 GTX, and actually this is the one we use. But if you use the later NVIDIA it's -- the speed is over here which is at a teraflop now and you can -- we are able to run this thing at 100 frames a second with still some computation to spare. Of course Dell double or didn't used to sell these computers when we wanted them, so we had to buy computers which had funky lights in them from these game manufacturers. >>: [Inaudible] hundred frames per second of what kind of resolution? >> Ramani Duraiswami: Yes. The resolution is relatively course. It's related to the -- so the resolution is about 10,000 pixels to get 4 pi coverage of the room. >>: Does it also depend on the number of frequency you ->> Ramani Duraiswami: This was -- we were doing it at about 20 frequency bangs which were covering -- and I skipped -- I glossed over some details. So for example, the lower frequency bangs you cannot do very high order beamforming. You can only do order two and order three in hundred hertz and so on, but as you get to the kilohertz range, you can do a higher order beamforming which gives you sort of more resolution images and those images which I showed you were corresponding to the four kilohertz bang. So that's an important distinction which I jumped over. Okay. So now I'm going to show you some images of the sort of realtime thing. So this is the image of how the calibration was done. So these essentially are taking [inaudible] and so he's sending off the sound and at different locations so once that's done you can do some interesting things. So once you have the calibration done, the next thing you can do is you can do image transfer. So now with [inaudible] I'm sort of scared to talk about image transfer, but anyway, I'll still talk about image transfer. So suppose I have two cameras and I assume that the world is far away, right? So then just knowing the fundamental matrix, I can do image transfer between two cameras. So and so I'm going to show you. So here Adam is going to switch on a speaker so the top is the audio image, and we ignore the fundamental matrix between these two cameras. So we're going to transfer the audio image to the video image. So we switch it on. And as soon as he switches it on, those pixels light up, which is where the sound is coming from. And so now we sort of go through -- he's speaking, he's flicking his thing and so on, and you can sort of find -- and you can get an idea of the resolution. So that pixel in the regular image essentially comes much, much bigger. >>: [inaudible]. >> Ramani Duraiswami: The first colors are actually that we are showing our GB images, our G and B are different frequency bands which are mapped to the [inaudible] sound color. >>: So the [inaudible] is what? >> Ramani Duraiswami: Exact spot in this resolution. >>: I see. >> Ramani Duraiswami: That resolution is very loyal. Whole 10,000 pixels is basically giving us 4 pi. And this is just a very small sort of field of view, and you're looking at this ->>: [inaudible] smooth out there [inaudible]. >> Ramani Duraiswami: Yeah. >>: Pretty slow. >> Ramani Duraiswami: It's [inaudible]. >>: We're using a little bit of filtering so the image persists for a little bit but it didn't sort of disappear after you snap [inaudible]. >>: [inaudible] speaker. >> Ramani Duraiswami: So now let me show you some more applications. So the intersection application, so here we can sort of consider that this application was essentially audio is helping video to find out where the sound is coming from. So now let's look for an application where video is helping audio. Yeah? >>: I missed something. [inaudible] you would only have an [inaudible] the camera and [inaudible] so you should have like [inaudible]. >> Ramani Duraiswami: So this is the point I was making that if you assume that the world is part of a and it's on the surface where the objects are far away, then if you know the two cameras and you assume that the sources you can get the direction and you can project on to this direction. To do it exactly correctly like you are saying with the range information, we will need three cameras. Then we can do sort of image transfer between three cameras. But if -- but this assumption actually works -- >>: So this [inaudible] camera reasonably close? >> Ramani Duraiswami: They're reasonably close, yeah. And so now the -- I'm going back to the same one. Okay. Okay. Now, in this application what we are doing is we have Adam and we have the video camera and the audio camera. And there's a very loud sound source, okay? And this sound source is music, it's playing very loudly. And we are going to try to beam form, use the epipolar space to help beamforming. So in this I'm going to play some sound so essentially the first is sort of the sound without beamforming, and you cannot make out what Adam is saying and then you'll just hear the sound which is obtained by searching along the epipolar line for the peak. And in this case, you'll sort of be able to make out what he's saying. And there was no sort of clicks like Ivan [phonetic] uses of forced filtering and all this wonderful stuff used. If you used those, that would be even better. Pure beamforming. [music played]. >> Ramani Duraiswami: Now, with beamforming along the epipolar line. [music played]. >>: [inaudible]. >> Ramani Duraiswami: That's the spherical microphone. >>: So it's big? >> Ramani Duraiswami: No, no, no, the one in the back is actually something else totally. >>: Oh, okay. >> Ramani Duraiswami: But it has to do with [inaudible]. This is the [inaudible]. >>: Oh, okay. >> Ramani Duraiswami: That little white string. That also is a microphone array but that measures your [inaudible] transfer function. But that's in the lab. >>: [inaudible] user source from the [inaudible]. >> Ramani Duraiswami: So both are relatively the same distance to the microphone. So you can also do sort of things with multiple sources. So for example, we did a lot of experiments with sort of two sources, how identifiable they are. So those are all reported in paper which is in the transactions of audio processing. This is impress how close they can be and you can still identify how far apart and so on. Okay. Yeah. >>: In that case [inaudible] direction you estimated. You're knot trying to [inaudible]. >> Ramani Duraiswami: No, nothing. So this is very trivial beamforming, which is sort of just the main load. Okay. So the next idea was in audio most people know that if you have a room, you get reflections from all the walls in the room and that affects listening quality. Especially in concert haul acoustics people are very interested in designing the concert so that you have good direct path. There are some early arriving reflections which are sort of -- which have to be distinct, and later distinction reverberations have to be sort of attenuated and so on. And this is indeed sort of a black art which architecture acousticians do. So could we use the audio camera somehow to help architecture acousticians? So we went and measured essentially this is a very nice music hall which is at the campus of the University of Maryland, and this is a panoramic image of this place. And here we have placed a sound source on the stage, and this is a spherical panorama which is sort of unwrapped here. And you can see this hall. So now I'm going to play you a demo of a slow motion movie which we got. So I'm going to now to display the spherical panorama in this fashion. And so now this is on a sphere, so it sort of looks more reasonable. And the sound sources look [inaudible]. Okay? So as I play this sound, it -- you can see the sound come off. It's a very short chirp. And then you can see the sound reflect off the walls. You can see the location of the reflections. You can see it's reflected off the floor, off the back wall, and you can sort of -- this is a short 10 millisecond chirp, but as you start going along you see that for several seconds essentially all the reflections in slow motion as so you can see it's now reached the ceiling. You can see the reflection of sound in all the locations. Oh, by the way, this just -- these colors are sort of normalized so that in each frame the maximum color is the same is red. So the maximum intensity is red. Otherwise they will be attenuating and you will see sort of back drop colors. So essentially many -- we've shown this to architecture acousticians and they are very, very excited because apparently it takes them two months to fix a hall after it's built and having such tools they can go and they get these things. Of course we need a much more portable audio camera to do that, and we are sort of working on that as I'll show you in a second. >>: [inaudible] after many second it's still very coherent? >> Ramani Duraiswami: Right. It is. And it's amazing to ->>: [inaudible] would have expected the whole room to eventually be filled with ->> Ramani Duraiswami: Yeah, it is. It -- if I keep going I sort of stopped it, but if I sort of take it along as time goes on, you'll sort of see. It's not -- so there's so many reflections all over the place as sort of at later times that ->>: So the whole time sequence is a section -- >> Ramani Duraiswami: It's, yeah, second -- it's more -- it's sort of -- it goes -the screen is not -- it goes to about two something. But the original chirp was only several milliseconds [inaudible]. >>: [inaudible]. >> Ramani Duraiswami: Yes. Exactly. >>: So we're seeing -- this is repeated in a loop or something? >> Ramani Duraiswami: No, no, it's still going on. >>: So how [inaudible] oh, I see. Okay. [brief talking over]. >>: [inaudible] cut off a little bit. That's sort of a plot of what the maximum value in that particular image is as [inaudible]. >>: So that's [inaudible] you see. >>: [inaudible]. >>: Well, it's exactly difficult to identify whatever the reflection is. >> Ramani Duraiswami: Yeah. It's hard to identify but now we're going to sort of come to another use. So the next goal is of course we did this for concert hall so can we use it in room acoustics somehow? So in room acoustics we all know sort of reverberation is often treated as something to be not -- is not a friend. It's bad, it sort of screws up algorithms and so on. There have been people since the '80s and '90s who have sort of thought of doing what's called matched field processing. So the idea there is suppose I know the source and I know the room impulse response. Then I can potentially invert this room impulse response and clean up the sound field. Right? But then so the most famous person who sort of worked a lot on this area is Jim Flanagan, and he essentially wrote several papers trying to do this idea and even one of his students is also at Microsoft, but he's moved to Live Labs or Search Labs [inaudible], and they worked on this area of trying to do basically match field processing. But what they found is even a small error in the room impulse response essentially destroyed the results. Okay? So now the thought we had was is there some way we can use this audio camera with the visual camera to figure out which reflections are actually obtained at the microphone array and can we use those in some way to improve the signal quality as well as do reverberation. So this is essentially what I just said. So what I'm going to do is now going to show you this movie. I guess it didn't launch. Or maybe I'll just go to the directory. >>: [inaudible]. >> Ramani Duraiswami: Okay. So let's -- I'll go, I'll show you the demo at the end had sort of MATLAB recovers. Let's -- so what is the algorithm? So essentially what we do is go to a room, we record different people speaking. So that room in which we did this recording is about a third of this -- maybe a fifth of this room size. It's a small conference room. And we had two sources speaking simultaneously. And our goal was to essentially find the reflections which corresponded to each source, used the spherical array and point the beam for the direct -- at the location of interest and also find those reverberant reflections which are visible to the array and which correspond to the source of interest and clean up the sound while suppressing the other sound and simultaneously [inaudible]. Okay? So the first thing we needed was we needed a mechanism to figure out which reflections correspond to source A and which reflections correspond to the distracting source. And to do this, we need to sort of build a similarity function which essentially looks at the beam form signal coming from the source of interest, and all reflections and see if these come from the same source or they come from a different source. So we used MFCC features to do that. And so here is sort of images from the audio camera. And this is with the images corresponding to one source, the images corresponding to the second source, and the images corresponding to the -- when both sources are present. And on the left, we are showing the MFCC similarity metric which is computed on the beamformer output. And the right two panels correspond to that similarity metric applied to the image when we have both sources present. So essentially when we had both sources present, the beam form -- the image should have looked something like this. But we are able to suppress the -- these other reflections and only have those reflexes which correspond to the female, an similarly for the male we are able to suppress the female reverberations and get those reverberations which correspond to the male identify. >>: [inaudible]. >> Ramani Duraiswami: Okay. So this image is -- this image is the audio camera image. This is for the male lone -- sorry, for the female alone, for the male alone, and this is for both of them. >>: So those other lighter ->> Ramani Duraiswami: Reflections. >>: [inaudible] reflections. >> Ramani Duraiswami: They are reflections. And here, this is the -- now we run this MFCC similarity guy for each -- for every direction, and it tells us should location, this location, this location, and this location and this location are the locations of the reflections for the female. And we run -- so this was run first on the case where we had the female alone and now we ran it also on the case where we had both the female and male speakers and we are still able to sort of locate the reflections which correspond to the female. Likewise we did it for the male alone and for the male in the presence of the female and again we are able to locate the reflections which correspond to the male speaker. So now since we know -- so the idea of the algorithm is we beam form to not only the source but also to the locations of the peak of the similarity metric, and now we are going to do delay in some beamforming, put for the beam form signals from the spherical image. >>: So suppose [inaudible] using you know the regular [inaudible] how much [inaudible]. >> Ramani Duraiswami: I don't know. But this is a very good question and this is actually part of the presses. This is what we did last month. So I'm just -- and we also in the -- during the course of this research, we also developed a very good time delay estimate or -- and now we are sort of checking how good the -because it works very nicely in reverberant environment for performing sort of delaying some beamforming, we need to figure out the delay between the reflected image sources and so here is sort of the signal channel source. This is the -- after the -- doing the proximate match filter our process and this is the original. And we get significant improvement at least to perceptual quality but we need to run some more tests to sort of give you quantitative say best scores or something like that to tell you how much improvement we've had. Okay? So now let's sort of look at the evolution of the hardware. So if you want to use this guy and we want to sort of have many people use this guy, we need to give it out in a portable format and have it go around. So the first one we built I said is this spherical array, and you can see the and of wires coming out from each of the microphones sort of here. And of course the hemispherical area was nice because we had all the hole in the table. We could sort of take this huge parade of wires and take it. Put this makes the whole thing very unportable. So you want to get it portable. So about 2007, 2008, we tried to develop a portable version, and this is a 32 channel array, which we develop and we moved all the analog to digital conversion electronics into the center of the ball. And so this guy just has one power and one USB wire coming out of it. And there is some FPDA stuff going on there which essentially takes all the analog to digital and packs it into USB and sends it out. >>: [inaudible] fit into 2 channels. >> Ramani Duraiswami: In one wire. >>: In one wire. >> Ramani Duraiswami: And now ->>: [inaudible]. >> Ramani Duraiswami: We -- this one is 12 bits, but we could to 2 point -- but we are actually using very small percentage and of the available bandwidth. So we can go to 256 if you wanted microphones at 24 bits. And that -- so now the version we developed earlier, end of last year is now 64 channels. So this is a five and a half inch sphere. And it has one camera which is -- which is also built into it. So this is a single USB camera, and this is what -- so this is the camera looking out. We can look up with this. And now we sort of the license that design to a company in our area and these guys actually now building an array which has six cameras which sort of stitch panoramas at the same time as they're sort of taking the panoramic audio images. And so you can sort of -- and it's much more rugged. These guys they sort of built enclosures and so on. And they are trying to ramp up and trying to sell these things to ->>: [inaudible]. >> Ramani Duraiswami: The radius is about 5.8 inches, which is about [inaudible]. >>: [inaudible]. >> Ramani Duraiswami: They can to it smaller, too, but this is [inaudible] so I sort of started a few minutes earlier, just sort of finished by just sort of saying some other applications of this spherical area which we are doing. One is we are also want to use it for remote reality reproduction. So we want to essentially capture reality at some point. Say take new a concert hall and then have the ability to place new that concert hall. So we capture the spherical array sound, then we do a -- what's called we convert the sound field into a plain wave representation. So we know the plain waves which are sort of impinging at that location. And now if I know your head related transfer function I can place you at that location by taking those plain waves and convolving them with your [inaudible] transfer function. So that's -- and then this is the scene recording, scene playback. So as far as head related transfer functions are concerned, these are sort of things, so I'm just going to sort of skip very quickly to this thing. So it turns out that human beings when we hear sound we don't hear the actual source sound. The sound which is actually received at our ears. So if this is the envelope of the frequencies of the sound which was sort of from the source, the sound which is actually received at your ear is quite different because some frequency are enhanced and some are attenuated because of the process of scattering of your own body. And so you are changing the color of the sound. Okay? So when you are changing the color of -- in light this would be like taking a CD and moving it in light. You'll get lots of colors in light because the wavelengths of light correspond to the sort of bumps on the surface of the CD so you get these color shimmy. So just like that when you move in sound, you are changing the color of the sound which is coming [inaudible]. And the interesting thing is that this process of changing the color of the sound which is different for evidence person. So every person has -- changes the color of the sound which is received differently and this gives you sort of additional cues to do source localization. And if I want to reproduce virtual reality and I want to sort of give you very stable sources locations in sort of audio presentation over head phones, I need to know this transfer function for every individual. And this measuring this transfer function and every individual shapes are different and so the transfer function is different for every person and measuring this transfer function used to be a relatively tedious affair. What would you do is you would sort of take speaker, player, chirp from one location then play a chirp from another location, another location and so on. And you'd sort of do this over two hours and maybe a thousand directions, and that would give you this head related transfer function. But we use an idea which is often used in vision, also, which is called Helmholtz reciprocity. This principle state says suppose I have a sound source at a given location, let's say here, and I have a receiver here. If I swap the location of the source and the receiver, I will get the same measurement at the receiver location. So this is sort of showing you that in simulation even though the overall acoustic feel will be totally different between the two points there will be this principle of reciprocity. So we use this idea to create -- to measure head related transfer function in few seconds. So the idea is we place headphone drivers turned out in people's ears. We use that 5:00 phone array which you saw in the background and we record the receive signal for all directions in one as opposed to sort of doing direction by direction recording and we can get this head related transfer function. And if you compare the head related transfer function measured the two ways, we get the same thing. So this is related to sort of scene reproduction work. We also do sort of computation of head related transfer function computation of scattering. So now we are developing sort arrays which are not necessarily spherical shaped but say head shaped or other shapes. So suppose I put microphones on some arbitrary scattering object like a robot, a network of microphones? How can I beam form with those microphones? So then I would have to solve the wave equation to get those beam form arrays and we're sort of working on those. But we're going sort of too many different directions so I think I'll just take this opportunity to include and thank you all for inviting me here. [applause]. >>: [inaudible]. >> Ramani Duraiswami: Is too low? It's -- we actually what we did is we first sealed [inaudible] okay? >>: [inaudible]. >> Ramani Duraiswami: Yeah. And then from the speakers. But since this is a sort of important question, what we did is we also put a micro. We took a dummy head and we put a microphone inside the dummy head and the sealed up [inaudible] and we sort of checked what is a DB level of the sound. It was only 65 DB, which is much lower than any even sort of it's like conversational speech. So this was -- otherwise, you know, it would be wonderful. You can measure everyone's head related transfer function but they are deaf [laughter]. But they are deaf at the end of the procedure which is ->>: [inaudible]. >> Ramani Duraiswami: Yes. >>: [inaudible]. >> Ramani Duraiswami: Almost entirely geometry. A little bit [inaudible]. >>: [inaudible]. >> Ramani Duraiswami: It's stuck. >>: Still stuck? >>: So if you [inaudible] geometry of each. >> Ramani Duraiswami: You can do -- you can compute it. But it's computations relatively expensive still. First you and then you. >>: Okay. So I'm wondering if the spherical -- I mean it's like [inaudible] what if I [inaudible] 8 by 8 [inaudible]. >> Ramani Duraiswami: You can do the same kind of processing and create those same kind of images. The only thing is especially with these real geometries if you put it in the other or if you have -- so if you have all these edge effects and to sort of get reliable beam weights to get sort of the same game in the each direction and so on, it's harder to do with regular arrays. So for example, edge array, the edge microphones have different weights, those kind of things, so there are -- there are constraints. But in principle if you could compute easily the beam weights corresponding to each direction you can use any procedure to create those images. So the spherical array buys you the sort of nice mathematical framework to get [inaudible]. >>: I guess [inaudible] so it's like [inaudible]. >> Ramani Duraiswami: It's very similar to the [inaudible]. >>: The [inaudible] transfer. So it seems easier I mean it seems easy to think about the spacing between microphones and -- big spacing that it's going to have on [inaudible]. >> Ramani Duraiswami: That is correct. So for example, if I have more microphones on my array, I can go to higher order, and I can get [inaudible] so for example when we went from the 64 microphones on the hemisphere, we got 120 microphones so we could sort of get order 8 beams out of that setup, whereas the regular 64 microphone array we could only get with [inaudible]. Okay. So this is the demo I was trying to show you which sort of died. This is the -- Adam, can you sort of [inaudible] around the room. So this is the room in which the experiment for -- this is a -- so what we did is the experiment had two speakers and we played database frames. >>: [inaudible]. >> Ramani Duraiswami: It was all cued up. Okay. So here is the room. You can sort of -- there's a table in the middle. So it has a very complex structure. So if you wanted to estimate is impulse response using simple geometry and so on it would be very hard. So now play a sound source. So this is one. Yeah. So that's one speaker playing a sound source. You can sort of see it's reflecting of the table of the wall, of the side wall. There's a second order reflection on that wall of the white board and so on. There are all these reflections. So the goal is of that idea somehow to use these reflections to do [inaudible]. >>: [inaudible]. >> Ramani Duraiswami: The room is about this direction 24 feet and widthwise it's about 11 feet. So this dimension is 11 feet, and this dimension along the table is about 24 feet. >>: This is also continuous sound. >> Ramani Duraiswami: So this is continuous sound because the sound is sort of bumping into the room. It's not a chirp. It's a continuous speech. >>: So how come when you see it coming and going [inaudible]. >>: Yeah, it [inaudible]. >>: It's continuous noise so -- if you could see that [inaudible] this resolution, you can see the [inaudible]. What we're showing in this particular image is just a high frequency [inaudible] and so you see in the energy is that when the energy happens to drop off a sort of randomly generated signal that's when you start to see the reflections. So of course they can't see that plot at all. >>: So what you'll see is you'll have short bursts of 5 or 10 milliseconds where the energy in the 4,000 hertz band is dropped off and that's sort of these points where you see the reflections come. >> Ramani Duraiswami: So the of course speech when people speak there is sort of interruptions and so on. But at the same time it's -- it doesn't decay like in the other case because the personalizing keeps repeating speech. >>: [inaudible] so it's only, you know, five meters across. Five meter across the [inaudible] so if it's a continuous sound, why aren't we just seeing it like emit from the direct path and then see the ->> Ramani Duraiswami: Okay. You to see the direct path so you can see the direct path go off and come on, go off and come on. >>: But this is not really [inaudible] this is [inaudible] no. >> Ramani Duraiswami: No. >>: This is very slow so ->>: This is like [inaudible] that's the type of [inaudible], right, right. >>: So why don't we see the direct path go off. >>: Because the -- that speech segment got over. >>: Oh, okay. It's speech segment. >>: Okay. >>: It would be a little more clear if they hadn't chopped the graph on the top of the energy but fortunately I guess try and use this program on low resolution it [inaudible]. >>: It's easier you [inaudible] short chirp. >> Ramani Duraiswami: Right. >>: And a chirp would disappear. >> Ramani Duraiswami: Put the chirp gives you the impulse response. But it's sort of -- what this will actually -- if you play that chirp of course you could use that to identify the impulse response. You could do that. But this -- so the goal here was not to do that, the goal here was to say suppose we just take this device and we know nothing about that thing, we don't have the ability to play chirps, we don't have ->>: Right, right. >> Ramani Duraiswami: How can we clean up the speech and record? >>: So the reverberation time, we didn't get into the reverberation there? >>: It was about two, 200, 300 millisecond. >>: 300 millisecond to [inaudible]. >>: Well, because the signal was sort of continuous, you will always see [inaudible]. >>: Well, it looked like -- I mean we didn't see a lot of color from other places, so. >>: Yeah, most of it was the direct -- if you look at sort of the accumulation over a short period of time, that's when you get the image from the previous ->> Ramani Duraiswami: So you can sort of see the other side. >>: Where you get sort of the spots that correspond to actually reflections become persistent if you sort of average these images and keep a running average that you can see where all the people [inaudible] are actually. >> Ramani Duraiswami: Accumulated. >>: Accumulated. And that's what we use in ->>: But things aren't just sort of becoming ambient. >> Ramani Duraiswami: No. >>: Reverberation though, right. >> Ramani Duraiswami: No. >>: I mean presumably if you [inaudible] quick impulse we would see some level of ambience, but it would happen much quicker than. >> Ramani Duraiswami: And at the same time as you played some continuous noise and then you may see all these sort of standing waves develop and swarm. That also we didn't do yet. >>: [inaudible]. >> Ramani Duraiswami: Right. >>: [inaudible]. >> Ramani Duraiswami: Right. >>: [inaudible] source come from. >> Ramani Duraiswami: Right. >>: [inaudible]. >> Ramani Duraiswami: Right. >>: So there are a lot of people. >> Ramani Duraiswami: Yes, so we only tried so far with two sources. And we are able to distinguish the reflections of the two sources. >>: I guess like you try to [inaudible]. >> Ramani Duraiswami: Yeah. >>: Over [inaudible] is the [inaudible]. >> Ramani Duraiswami: And so there are many people who are sort of research. So we would have to build stronger metrics which are not based on just MFCC maybe something else but anyway. This is -- but this is sort of we will studied field. We have to borrow their ideas. But this is just the first parts. >>: So I have a question. Microphone array as a camera. >> Ramani Duraiswami: Right. >>: I [inaudible] I wonder didn't actually have [inaudible]. >> Ramani Duraiswami: So we are working on some concept which is the focal length of course is sort of the radius of the array, which is the -- put we are working -- your question is sort of interesting. Can, for example, can we build a telephoto lens of this camera? >>: Right. >> Ramani Duraiswami: And we are working on some ideas like that, trying to see if we can -- so that's work in progress. If it happens we will -- yeah? >>: You mentioned that companies going to build a product of this [inaudible]. >> Ramani Duraiswami: Yes. >>: [inaudible]. >> Ramani Duraiswami: So the -- they're going to build five next month. The first five units will be built next month. And then they have sort of two or three potential customers. If they are able to sell to those, they will continue building. Otherwise. >>: So the [inaudible]. >> Ramani Duraiswami: Yes. Yes? >>: So out of that device are you going to get discreet channels or will there be [inaudible]. >> Ramani Duraiswami: So that device is just going to send you in a UBS. The two UBSs. So one will have the image stream, which is the camera. >>: Camera. >> Ramani Duraiswami: Camera. And the other will have the sound which is sort of packed channel by channel. >>: Okay. So in the computer then would be the. >> Ramani Duraiswami: The processing. >>: To the image? >> Ramani Duraiswami: Right. >>: Okay. >>: Just to make sure on this then, we are doing the spherical you're just transmitting the transfer function from that direction to that one. >> Ramani Duraiswami: Yes. >>: And then you [inaudible]. >> Ramani Duraiswami: Right. So the -- no. Actually not transfer function. What we are doing is we are doing so for example because we have the [inaudible] channel, so the way this would be used with actual speaker is we would run a -- we would run a face detector we know by the face detector what are the locations of the people. >>: Suppose you are [inaudible]. >> Ramani Duraiswami: Yes. >>: [inaudible]. >> Ramani Duraiswami: So then once I know the primary direction of the source, I take the beam form sound from that direction. >>: But the question then how do you get [inaudible]. >> Ramani Duraiswami: Using the spherical array and the knowledge of the source location. >>: I know. Why are you using the source location [inaudible]. >> Ramani Duraiswami: Yes. >>: So what did you ->> Ramani Duraiswami: So now I run my spherical array beam former. >>: Once the coefficient [inaudible]. >>: [inaudible]. >> Ramani Duraiswami: Those W weights with [inaudible]. >>: [inaudible]. >> Ramani Duraiswami: They are based on plain wave scattering of the spherical array from that particular dimension. >>: [inaudible] the particular microphone is going to have a [inaudible]. >> Ramani Duraiswami: Right. >>: Then you revert that and then [inaudible]. >> Ramani Duraiswami: Right. Precisely. From that direction. I know the solution for that direction. And ->>: [inaudible]. >> Ramani Duraiswami: Well, we don't need to calibrate too much. If you calibrate -- so for example, there are some corrections we could supply. So these sources -- these rates correspond to infinite -- to far field waves. If you want, you can sort of connect those weights for near field weights and there are procedures to do that but we don't do that. >>: I see. So if you are sort of within those weights the corrections would have to happen per our sort of size array. If you're closer than a meter to the array. If you're more than a meter away you can still use the far field weights. >>: [inaudible]. >> Ramani Duraiswami: Sorry? >>: [inaudible]. >> Ramani Duraiswami: You would have to collect both the gain and the correction. So there are -- this question has been studied a lot in the literature for head related transfer function because this is exactly the same thing again which is studied -- which is measured in that context. And it turns out that there are approximate geometric expression you can get how to collect the phase, how to collect -- how to collect the gain. But if you are beyond a meter and our fear is about head side, if you are beyond that meter you are -- you don't need to collect those. It's sort of less than [inaudible] DB, whatever [inaudible]. >> Cha Zhang: Okay. Let's thank the speaker again. [applause]

>> Cha Zhang: Good afternoon. It's my great... Ramani Duraiswami and Adam O'Donovan to give a talk audio...

Related documents

Products

Support

&gt;&gt; Cha Zhang: Good afternoon. It's my great... Ramani Duraiswami and Adam O'Donovan to give a talk audio...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Cha Zhang: Good afternoon. It's my great... Ramani Duraiswami and Adam O'Donovan to give a talk audio...