>> Ivan Tashev: I think we can start. So good morning. Thanks for coming here. Good morning to those who are just online. Today, we have Ivan Dokmanic from Audiovisual Communications Laboratory at EPFL, in Switzerland, with his adviser, Professor Martin Vetterli, and today he is going to present the results of his three-months-long internship in Microsoft Research here. Without further ado, Ivan, you have the floor. >> Ivan Dokmanic: Thanks, Ivan, for the introduction. So, I'm Ivan, and I am going to tell you >> Ivan Tashev: There is a difference. I am Ivan, he is Ivan. >> Ivan Dokmanic: That's right. So I'm going to tell you a bit about what I was doing in the last 12 weeks, actually. So let me start first things first. Just a small thanks, or a big thanks, to everyone that helped me out with various technical difficulties while I was doing my project. So the audio group, Mark -- I couldn't find a high-resolution photo, Jans [ph] and Ivan, and also I'd like to thank the hardware team, so Jason Goldstein and Tom Blank. Specifically, Jason, who -Jason, I was just mentioning you, who helped me a lot with building and debugging the hardware. So let's start with a short overview. Basically, I'm going to start with trying to motivate why you'd use ultrasound for something that's typically done with light, and it's done very well with light, and go through some basic principles really fast of ultrasonic imaging. Then I'm going to talk about our hardware design and beamforming, how to get directivity, some calibration issues, some other things that happened, and so on, and then I'm going to move to algorithms. I'm going to describe the naive algorithm. So the first thing you try to do, imitate the way you do it with light, and then why this doesn't work really well and the things that we tried out and some of the things that turned out to work much better, some things that we propose, specifically for increasing the resolution and for increasing the frame rate, so for obtaining a usable frame rate. And then I'm going to go over some experimental results and suggest things that could be done in the future to get even better results. Okay, so let's start with why ultrasound. Depth imaging is typically done by light, either by using the disparity between two images, two photos of the same scene, or by using lasers, for example, in Kinect and in Lidar. So this is good, and this works fine, but, for example, if you want to use cameras, you actually need the camera. You need a device that does it, which can be quite large, quite expensive, can use a lot of power. With lasers, there's potential always to use a lot of power, and these are all very fast signals. Light has a very fast frequency, so you have to take special are in the design in order to design the circuits so that they're good for RF frequencies and so on. So what are the ways that ultrasound can serve, provide something else? Some of the advantages are the following -- first, the frequency range. The frequency of the ultrasonic sound that we use is 40 kilohertz, and 40 kilohertz is fairly low, in comparison with the frequency of light. So this means that you can design very simple hardware. You don't have to pay special attention to PCB design, to RF design and so on, so the design is very simple. On the other hand, the propagation of the sound is fairly slow, in comparison with light. So this means that your wavelength is not going to be too large, which is good, because for imaging, you don't want your wavelength to be large. This is somehow a fortunate coincidence. Other good thing is power consumption. So I snipped here two images from the data sheet of the transducer that we're using, and you can see here at frequencies of interest, the impedance of our transducer is something like 500 ohms, and we want to drive it get maybe 100 dB, SPL of maybe 105, so we want to drive it to maybe five volts RMS, which results in 10 milliwatts of power consumption if you use it full duty cycle, but we don't use it full duty cycle. We use it only a part of the time, when we emit pulses. So do your math, we can really go low in power consumption. Sensors are cheap. They're quite available off the shelf. The ones that we have are fairly cheap ones, and in some ways it's complementary to light. So light might fail if you have, for example, a thin piece of fabric in front of you or if the space is filled with smoke or if there's a lot of mirrors. It could be complementary in some regards. So, of course, there's a lot of challenges with ultrasound, perhaps much more than advantages when you look at it for the first time. So first challenge is the frequency is still low, so there's no way to get the level of detail ultimately that you would get with light, because the wavelength is, for example, of this ultrasound that we use, a bit less than one centimeter. Another issue is directivity. If you have lasers, lasers are quite directive, by themselves, right? There's no mention of trying to make them more directive, but with sound, sound sources are typically not so directive, and so we need to find a way to sort of imitate what we have with lasers with sound. Another issue is slow propagation, and that's a very annoying one. So if you want to imitate light, so if you want to raster scan the scene using ultrasound, something is maybe three meters away, for a pulse of sound to travel there and back, it takes maybe 15 milliseconds, which would amount to maybe 60 points, 60 pixels, if you scan in a scene, or if you want more pixels, you'll have a really lousy frame rate. This is not good. This is completely unacceptable, so we need to deal with that. And the most annoying thing, probably, so the specularity of sound reflections. So sound always reflects specularly. Very little -- a very small part of this reflection is a diffuse reflection, which means essentially that if I'm standing here, and I'm shooting directly to the wall, I'm going to hear a strong reflection. So if I have a source of sound here and a receiver there, then a strong reflection is going to happen somewhere there. But if I'm trying to see what's happening there with the source and receiver standing here, I'm out of luck, essentially, because the sound, it won't reflect here. There is a small diffuse component, but it's completely impossible to see it with what we currently have, so we also need to somehow deal with this. Okay, another challenge is, of course, attenuation of ultrasound in air. Sound always gets attenuated when it travels through air. There is -- for point sources, we have 1/R law, just because if you put some energy in a point source, then it travels and this energy is spread over larger and larger spheres. So if you're listening in a certain point on this sphere, you hear a sound that's a bit lower intensity. But ultrasound gets attenuated in air because of the compression and expansion of air much more so than the ordinary sound. In fact, above frequencies of around 100 kilohertz, you get attenuation of more than three decibels per meter, which is quite substantial, so for something two or three meters away, you will be attenuated something like 20 decibels, which is quite a lot. So, with all these challenges, a completely legitimate question is why bother at all? Why try to do it? And we have a good example in nature of this beast that can actually do quite well with ultrasound. We hope that the limit is far away. We still don't do as good as they do. These are quite fascinating little creatures, and they use frequencies that mostly are ultrasonic, so all the way up to 200 kilohertz, and they have fascinating resolution, resolving ability. For example, the big brown bat can detect spheres of roughly two centimeters across at five meters away, and perhaps the more fascinating thing is they can resolve detail that is one-fifth of the wavelength that they are using, so beyond the diffraction limit, using some heavy spatial processing in their right brain. We're not quite there yet, but we can hope for it in the future. So let me go quickly over some related uses of ultrasound, and ultrasound is mostly used -- the vast majority of ultrasonic applications are in biomedical imaging, and for a good reason, right? So not so much ultrasound was actually used in air, and a typically image that you see in biomedical applications is an image like this one, and this is called a B-mode scan, so what is a B-mode scan? How do you obtain a B-mode scan? The simplest possible way of obtaining an ultrasonic image is by just inserting sound, so emitting sound into the body, and then recording reflections as a function of time and plotting them as a function of time. If you do this in two dimensions and encode the amplitude or the reflection as the intensity of the pixel, you get an image like this. So in body, you can do that, because always some part of the sound will get retransmitted. So something will reflect on here, but some sound will go on. Then it will reflect further. The medium is like this random scattering medium, so every point of your tissue acts like a small scatterer and reflects some sound back. Okay, so in air, perhaps you know, in the late '70s, that's one of the first users. Polaroid was using ultrasonic sensors as range sensors in their cameras, and then there were people like Lindstrom in the early '80s that tried to image the spine deformation. They were trying to use ultrasound in air to image spine deformation, so they were scanning mechanically an ultrasonic device along the spine, and they were obtaining images like this one. Then, in the early '90s, they were trying to use ultrasound to measure the skin contour, so it's slightly higher frequencies. And these sensors were always very close to you, so this is not our scenario, where we want to image something that's further apart, further away. Okay, so more close -- closer to what we're doing, we have some more recent developments, by Moebus and Zoubir, for example, from 2007, where they used 400 microphones in a synthetic aperture to try to image objects that are a meter or two meters away from the imaging device. And we also have some attempts to build aids for visually impaired people. For example, this guy's 64 electric microphones to image the scene, and these guys here also use ultrasound in air, but they use different -- I'll come back to this later a bit. They use different kind of transducers, different than us, something very small, very efficient and something that's probably the future, so that's why I'm mentioning it here. And what I'm showing here are two images from the paper by Moebus and Zoubir. So they were trying to image -- here, they were imaging a PVC pole. I'm not sure exactly what distance, but I think it was five centimeters across, with 400 microphones, so this is the image that they obtained. And here, they're trying to image a cubical object, also the same distance. So you understand that specularity is a big problem. You see now what I was talking about when I was talking about challenges of ultrasound. There is no way to get a reflection from this part of the pole, here, because this doesn't reflect back to the device. So keep these images in mind for later. So let's talk a bit about hardware and beamforming and everything we have to do with this. So the first thing that I had to do was I had to choose the proper transducers and proper sensors. And this was quite a challenge, because you sort of need to understand what will play correctly with the hardware that we have here, motor units, how to interface these things to a PC and so on. And so, very common design of the transistors is piezoelectric, so just a piece of crystal that vibrates when you apply AC voltage across it, and these are some typical designs. So this is a closed one, if you have to expose it to the elements. This is the one that actually we use. The device is here. The whole thing is here. I can send it around later. And then, you have these things all the way -- so these are really small ones. These are really big ones. They spend something like hundreds of watts, and I found this black one, this guy here, I found on eBay, and the seller lists a couple of intended functions. He ways, one, medical treatment. Two, beauty device. Three, heal allergy. So there we go. Yes. Here we have another example. This thing here is 150 watts, okay, and it goes up to 35 kilohertz. I'm just showing it to understand that there is a huge diversity of these things, and the guy whose webpage I took this from says, the sweeping ultrasound can cause certain adverse effects like paranoia, severe headaches, disorientation, nausea, cranial pain, upset stomach or just plain irritating discomfort. Great for animal control. A lot of interesting things out there. So this is our device, and once again, thanks, Jason, for helping immensely with this thing. Okay, we're sort of debugging this infinitely, and still it has a lot of problems, but we did our best. So let me talk a bit about microphones. So I explain what kind of transducers, what kind of sources we can have. These piezoelectric devices, they can also act in the opposite direction. When the crystal vibrates, they generate AC voltage, so you can use them as microphones, but we decided to do a different thing. So we're using here MEMS microphones, typically intended for use in cellphones, and these microphones are typically intended to go up to maybe 10 kilohertz, but it turns out that they also hear 40 kilohertz. Their polar diagram should be omnidirectional, which at 40 kilohertz and with our design, apparently it's not. I'll talk about it later. This is kind of a challenging point, but the nice thing about this particular model by Knowles is that it has a preamplified differential output, so even if I drive it to the preamplifiers I'll show on the next slide, actually, I could this directly to drive line inputs. This is very convenient, and it has a differential output. These are the transducers that I was talking to. We interface everything to a PC using to MOTU using DB25 snakes and so on. This is the battery to power the microphones, and this thing uses a separate power supply, because we need to drive it to maybe 10 volts RMS, so we can't use a battery for that. We could. Okay, this is the overall design of our system. We have here the microphone array and the speaker array. Microphones go to microphone preamps and then to MOTU unit, and everything is driven by MATLAB from a PC, using the Firewire -- is actually USB. A cool thing, as I'm saying, is that you could actually drive these guys directly to MOTU. This is something that we learned later, but this is just in case we wanted to preamplify them a bit. Okay, so let me talk a bit about beamforming. For those of you who perhaps don't know what beamforming is, which is none, so we have to find a way to direct sound to certain directions, and also to listen directionally. So it turns out if you have more than one microphone or more than one loudspeaker -- even if this microphone is, for example, omnidirectional, with more than one you can achieve so that the overall characteristic, the overall directivity of the system is selective. So by properly playing with the signals that you record from these microphones. So we can use this to create a beam of sound or to listen in a beam of sound and try to steer this beam electronically, which is another cool thing, because you can do it electronically by changing the way you play with these signals, and to scan the scene. So there's many ways to do beamforming, but somehow two opposing extremes are something we call minimum variance distortionless response. Beamformer, it's not important what it really is, and delay and sum, and they're different in the following way. MVDR is great if your system is perfectly calibrated, and then it gives you the best possible beam. But if your calibration is imperfect -- here, calibration is imperfect -- then actually it could perform worse than a very simple beamformer, which is delay and sum, and which turns out to be the most robust to manufacturing tolerances. So we sort of lean towards the delay and sum beamformer because, unfortunately, we don't have excellent calibration of our device, even if we wanted to have it. We tried. So this is a polar diagram of a three-element microphone, of a three-element microphone, just to understand how these beams like. So if you're for example listening -- microphone array, right? So if you're listening toward the direction of zero degrees, we get sound intensity of one, and if we're listening, for example, to 30 degrees, then we get less than a third, but still quite substantial, right? There's still a lot of sound coming from this direction. Okay, something cool that we can do using this beamforming theory is we can think about how to properly design the geometry of our microphone and loudspeaker arrays, because obviously not all geometries are equally good. Some geometries provide us with better beams and some with worse beams. We need to find a way to measure this, and a classical, good, reasonable way to measure it is using something called the directivity index, which just if B here is the directivity pattern, the directivity index just says what is the ratio of the directivity towards the direction that I want to be looking at with respect to the average directivity, in all directions? And this is a completely reasonable measure, and so what I was doing is, I took a couple of candidate geometries, so cross geometry, a square geometry and something like a circular geometry, and I was varying for reach of them the spacing, the pitch between microphones, and I was evaluating the directivity index for the corresponding geometry, and I was trying to find the best one. It turns out that for a microphone array, somehow, the best overall geometry was this square one, and the spacing between the microphones that we used is 6.5 millimeters, which is slightly more than lambda half, than half of the wavelength. Okay, so this is the finalized geometry. For loudspeakers, for buzzers, for sources, unfortunately, we couldn't optimize it properly, and the reason is the physical size of the thing is just too large, so we have a mechanical reason not to be able to put it as close as we would like to. So their pitch is 11 millimeters, which is the smallest mechanically allowed pitch. And what you can see here is that we angle the transducers a bit out. We tilt them out. Again, this 20 degrees here is not an accident. It's an optimized number, somehow the best one, and the reason is that the transducers themselves are a bit directional. So in order to cover a wider field of view, it's a good idea to just tilt them a bit out. It turns out that the beams in extreme directions will be better if you do so. So this is just a simulated beam from our microphone array. This is just to understand that even if you want to have a point here, a pixel, we don't have it, unfortunately, so this is what you would get if you would be looking at the beam from the front and just flatten it. So if you have one scatterer, a point scatterer, then instead of seeing a point scatterer, you see something like this, so this is the meaning of this image. So if you remember how it looks like, it's three microphones laterally, essentially, and three microphones vertically. So with a three-microphone array in some direction, you can't use much. We have eight. But if you want to extrapolate the performance you have in a horizontal plane, you need to square the number of microphones. So, for example, the performance that you get with eight microphones horizontally you would get with 64 microphones in the whole space. Okay, so let's go on. Calibration. Calibration is important. Of course, it's important. We want to have perfectly calibrated things. For example, here, we have the frequency response of our transducers, actually, of some other transducers, but all of them are roughly the same. So you see that at 40 kilohertz, they're the most efficient. And as you move away from 40 kilohertz, they are less and less efficient. So we want to equalize this, because we assume in beamforming that they have a flat response. So we need to push the frequencies away from 40 kilohertz. We need to push them up. So things to calibrate are phase, amplitude and polar response, so we should also measure the directivity patterns of these small devices if we want to do a proper, complete job. So the way I did this is in the anechoic chamber, here, and I have a laser here, so I stuck a laser here, and I was pointing this laser at the microphone to ensure that I'm really looking at the main response axis of these arrays. I was doing it many times. Unfortunately, I was only able to properly calibrate the loudspeaker array. There was an issue of repeatability, and I couldn't figure out why, but especially with microphone arrays, their directivity pattern is not at all omnidirectional, so small tilts of these things resulted in completely different gains. So we had to use some sort of overall good gain. Now, enough of that, so let me just show you the actual correction filters, how they look like for the loudspeaker arrays, just to show you that they make sense, that they look like something that we expected. So on the left-hand side, you see the correction to the amplitudes that we have to make, so you see at 40 kilohertz, they're very efficient, so we don't need to do anything. Away from 40 kilohertz, we have to amplify them, and you see that phases are fairly reasonable. Here, you also see another thing. You see that two of the transducers are very silent, for whatever reason, so we really didn't have time to debug this piece of hardware anymore, so we just said, okay, we're going to amplify them digitally. So you see these two transducers we have to push much harder than the other ones, in order to perform the same. So to highlight the importance of proper calibration, I'll just show you an example of a loudspeaker beam with and without calibration. Just this morning, I realized what I'm showing here is actually the microphone beam with loudspeaker correction filters, the message is the same. Don't worry. So on the left, you have the calibrated beam, which I already had shown you, and on the right, you have an uncalibrated beam. Perhaps this image shows it better. So the uncalibrated beam looks weird, and even if it's more beautiful than the one on the left-hand side, we actually want to have the one on the left-hand side. And you can see actually this sidelobes are most likely the microphones that we have to push harder, have to do something with them. Okay, so now that I'm done with hardware, let's go to algorithms. Okay, so -- first, I want to explain how, in general, you create ultrasonic images. What do you do? So I already was talking a bit about B-mode imaging, where you emit sound and then plot reflection intensity across time, along the time axis. In every mode of ultrasonic images, you emit some pulse and then you measure some parameters of the returned pulse, and depending on which parameters of the returned pulse you measure, you get different modes of imaging. So B-mode I already discussed. So in air, when you use ultrasound, B-mode imaging makes limited sense, because nothing is getting retransmitted. If I get a reflection from that wall, I should not hope to get the reflection from the wall in another room, the way it works in your body. So it makes much more sense to do something that we call intensity image, so just fine the biggest returned pulse from each direction and plot its intensity against angle, so this will be somehow an intensity image. And then in that imaging, we're not interested in the amplitude of the pulse. We're actually interested in the timing of the pulse, on some temporal property of the pulse, so when the pulse comes back. That's how we create depth images. So these are somehow different approaches. I'm going to quickly go over them and some problems, and then I'm going to tell you how we solved them. So now your approach is just a raster scan, so to do exactly what a Lidar does, for example, a raster scan, pixel by pixel, find the time of flight in this direction, find the time of flight in this direction, plot the times of flights and you get your image, right? So the problem with this is you have seen the beam, and in reality it's worse. So this is not going to work very well. I'm not going to be getting time of flight from this direction. I'm actually going to be getting something that's coming from the sidelobe, most of the time. I'm going to come back to this later. So we sort of propose some techniques for deconvolution. I explain how this is different from deconvoluting an intensity image and how we can deal with this. So then I'll propose another method, based on sound source localization, so how we can couple sound source localization with beamforming to get slightly better images, to get actually usable depth, quote-unquote, images. And I'll tell you how to improve the frame rate. So frame rate is one of the most annoying issues, so I'll explain you how to improve the frame rate from the naive approach, where the frame rate is completely unacceptable, to something that's completely acceptable, like 30 frames per second, while exploiting the whole -- every transducer that you actually have. So, before moving on, I'll just briefly go over the choice of the pulse that we have. This is a topic in its own. Bats use chirps. We don't use chirps. Bats are much smarter, so they use chirps. We use something that's quite picky, and it's like a filtered Dirac pulse, so it's just a complete pulse band pulse between 38 kilohertz and 42 kilohertz that we use. I could have plotted the autocorrelation of this thing so that you understand how our image, when we try to estimate the time delay looks like, but the autocorrelation of this thing is actually this thing, so it was not necessary, right? So you see how the spectrum looks like, so this is just a filtered Dirac, and here's the formal. It's the difference of two things. Okay, so this pulse turned out to be very good for detecting reflections when they come back, but perhaps you'd want to take some more time to actually design a perfect pulse. Okay, so let me explain a bit about deconvolution and why we have to think of different ways to do deconvolution than conventionally is done in ultrasonic imaging. So typically, deconvolution is done for intensity images, and so now imagine that that wall over there is a source of noise, is a noise generator, and different parts of this wall generate different amounts of noise, and so imagine that the wall is created by very small pixels of noise, something smallest that you can detect. Well, then I direct my beamformer towards certain parts of that wall, and I'm not able to listen to individual pixels. What I'll be listening to when I direct the beamformer in some direction, I'll be listening to a weighted sum of all these pixels. I was trying to represent this here, mildly successfully. So you get the weighted sum of this intensity across the beam shape. So this is actually convolution, some convolution, either spatially variant or spatially invariant with the beam shape, and this is a linear equation, so deconvolution here amounts to inverting a linear system. Then you can talk about conditioning. You can talk about do you really know the beam shape and things like that, but basically it's inverting a linear system. That's deconvolution in intensity images, right? For us, these small beams are supposed to represent these individual pixels, so what is the image creation model that we have in depth imaging? Well, imagine that you have -- that each small reflector here, each pixel, emits something, reflects a part of your sound back, and this sound, the reflected sound, I call X of theta, so for different directions, theta and T. So this is a function of time. Then, again, this gets involved with the beam shape, because you're not listening just to a particular pixel. You're actually listening to a weighted sum of pixels, so there's a convolution with beam shape. Then, what we do is we cross-correlate this with our pulse templates in order to find peaks, right? So I just wrote it as a convolution, with some pulse, then points U. And then we actually create the pixel by finding the maximum, by finding the largest peak, okay? So this is the time. The time of the largest peak, so this is our depth. Obviously, this is not a linear model, because there's a maximization inside. We cannot write this out as a matrix multiplication of some true depth image with some beam shape. Our measurements are not linear, okay? So there's no hope to just do very simple deconvolution, as is the case in the linear -- in the intensity images. I hope this is partially clear. >>: Why not the earliest time? >> Ivan Dokmanic: The earliest time? >>: You said the maximum? >> Ivan Dokmanic: Because the signal -- you will not actually get. So what if you have a continuous surface? So not discrete distances. Then, you'll get some sort of a smeared pulse. That's a good idea, and actually I'm using this idea in something else with source localization, but the thing is that you're not really getting the correct -- if you have a very wide beam, you'll get some weighted sum of these pulses that will overlap in time, so it's the question of how to disentangle these pulses. Did this answer your question? >>: Because ->> Ivan Dokmanic: Most of the time, you will not get discrete, separated pulses. They will be overlapping, if the surface is continuous, so I'm somehow trying to give the complete creation model. For what it is, the first part of the creation model in this case is definitely convolution, so if you look at certain times, the output of the beamformer is actually some combination of the outputs of this perfect beamformer, so the reflections that you would be getting from the smallest pixels. So by X here, I denote the smallest pixels, and for every time you get some combination of them. Partial I here represents all the pixels that contribute to the I beam. That's just a shorthand notation for that. So this, again, is some kind of a convolution. It is a convolution, right? It's a two-dimensional convolution, and you can write it out as a matrix multiplication, with some [indiscernible] like matrix A here. So, since this is correct for every time instant, we can sample the left-hand side and the right-hand side, and we obtain a matrix equation like this one here. Now, observe that this matrix equation has some vertical structure determined by the convolution, so vertically, things on the left are some convolution of things on the right with the beam, but what we don't exploit here quite yet is the fact that this matrix X has horizontal structure. And this horizontal structure is the structure of the pulse. So let me explain this. If you emit a pulse and you have just one reflector in your scene, one point reflecting, what you'll be getting back is the same pulse again, at least in shape. Forget now filtering, but roughly what will come back is the same waveform, attenuated and delayed, right? So this means that really every row of X here has to be an attenuated and delayed copy of the pulse, of the pulse that you emitted. So this is -- we can enforce this in our modeling by representing -- sorry, this should be D here. D. By representing X as a multiplication of a selection matrix, of a delayed selection matrix and the pulse dictionary -- so D here is a pulse dictionary. Every row of D is a different shift of our pulse. And then, matrix B just acts so as to select the proper shift, for each individual pixel. These ones are just representing nonzeros. B is obviously very sparse, and we know exactly how sparse it should be. Ideally, it should be having only one nonzero per row, because each smallest -- each infinitesimal pixel only gives one single reflection. And so the way to obtain the true B, which is then the representation of our depth image, is by trying to minimize the vector, it's like a compressed sensing, trying to minimize the vectorized one norm of our B matrix, so it's trying to minimize the number of nonzeros while ensuring that what it yields is not too far away from what we measured. And then we can have some non-negativity constraints. So one thing that you could try doing is, you could try to eliminate the dictionary from this matrix by multiplying everything from the right by the inverse of the dictionary. This will generally perform worse, because the conditioning of this dictionary is not very good, and so this will amplify the noise, and it will get a worse equation, but an advantage of this thing is that it's now separable. Here, B is in the middle of the two matrices. Here, B is sort of alone, and you can solve this column by column, so it has potential to work faster. This is just briefly some simulation results. So for this, I only have simulation results. The reason is, we haven't been able to really measure so far the exact shape of the beam, and this requires it, not perfectly, but at least to some extent. So this is some artificial depth map that I created, low resolution, because this thing is, for now, slow, and this is the mask that I was using. So I convolved this thing with this mask, and I generate temporal signals using some pulse that I generated, and I add a lot of noise -- a lot of noise -- to this pulse, and then I run the algorithm, and this is what you get. So you can get deconvolution performance like this one even with 10 dB SNRs, so with a lot of noise, which is much more than you would be getting in practice from ultrasonic images. So this is just to show you that this thing really works, works surprisingly good, but unfortunately slow, and for real problem size, it's still too slow for anything real time. Okay, so next problem is, of course, these sidelobes, and the thing that I was talking about before, so say you think you're looking there, but your sidelobe is looking towards that wall, and you think you're getting a reflection from there, but if a reflection from there is 1,000 times weaker and your sidelobe is only 100 times weaker in this direction, you'll be hearing this wall, and it's pictorial here. So in practice, we don't have a one to 100 ratio between the main lobe and the sidelobes, so this essentially means that you don't know what you're listening to, because of the sidelobes. There may be nothing there, but you'll get a reflection from here. I hope the illustration helps you to understand. So our proposed solution is to couple this with source localization, and to do it in the following way. So what I'll do is I'll listen to that direction, and I'll beamform in order to get as directional thing as I can in this direction, and then I'll find peaks in the beamform signal and then I'll go back -- I'll show this again. I'll go back to the row signals, recorded by microphones, and then I'll feed them into the source localization algorithm, and I'll try to find out where these locations are really coming from. And if where these reflections are really coming from doesn't agree with where I'm looking at, I will not put the pixel there. So this is a block diagram of the whole system. I have microphone signals here. I feed them into the beamformer, but also this generalization of music that I implemented, which is just a generalization -- it's like both for azimuth and elevation. Music is a very well-known source localization algorithm. So, for example, I concentrate on some frame. This is one frame. This here, highlighted in red, is one frame, so it's one direction, and I feed this into the beamformer. This is the output of the beamformer, and in this output, I find peaks. Here, I find one peak, but actually, I find several peaks every time. I find, for example, the largest three peaks. Now what I do is, I zoom in to this peak and go back to row signals, and I feed these row signals into the source localizer, and then I try to see if the direction this reflection is coming from agrees with where I think I'm looking at. And I do this for many peaks inside every beamform signal. This is an output of the source localizer. This is just the score for each direction, for each possible direction, and you can see that, for example, in this particular frame, there were two sources active. For example, I could have been standing like this and imaging myself, and I got a reflection from this hand and this hand together. This makes sense, because they arrived at the same time, but they're obviously arriving from different directions. Okay. What I really do in practice is, for every direction, I make a list of distances, so I'm not just saying this is the distance in this direction. I actually make a list of distances. So I have a large list of distances for every candidate distance, for every direction, and in the end, I pick the smallest one, because the smallest distance makes somehow physical sense, and it turns out to give good results. These are just for fun some possible outputs, some possible score maps of the sound-source localization algorithm. Remember, we don't have so many microphones. We have eight microphones for the full azimuth and full elevation, so that's why this is not so picky. But you can see that sometimes we have a clear source active. Sometimes, we have multiple sources in the same frame, which completely makes sense. So, okay, I'll show you results for this later. So now I'll move to the next problem, and the next problem is frame rate. So what people do with frame rate is, because, frame rate suffers because if you want to do raster scanning, so you want to point the beam at every possible location, then it's just too slow. Sound travels too slow and it's going to take too much time. So what people typically do is they say, "Okay, we can do microphone beamforming in the computer. We can do it offline. So we'll just use one source, just splash everything with sound and record it with microphones, need a lot of microphones, and then do microphone beamforming offline. But the limit on the frame rate is then quite high, because, well, you just need to send one chirp and then only do the receive beamforming. So now there is benefit in using multiple sources of sound, also. I give here an example by Moebus and Zoubir, who do a great job in imaging with 400 microphones, but they use this argument. They say, okay, we're not going to do transmit beamforming, because that's too slow, so we just use one source of sound. So what we propose to solve this problem is something really simple. It's just basic properties of LTI systems. I'm going to tell you a bit about why probably people don't do it too much. So let's go over, again, what microphones here. So this is the signal of microphones here. So lets' say that S(i) here is the signal emitted by the I source, and the H(ij) is the channel from the I source to the J microphone. Then, for all the sources, the J microphone just hears the sum of this, of these convolutions. That's very simple. So, for us, every signal emitted by a certain source is a filtered version of some template pulse that we're emitting, so we can write now this S(i) as sum W(i), which is the beamforming filter for this particular beam and this particular microphone, and U is the template pulse that we're using. Okay, so we can rewrite this in the frequency domain just as multiplication. So now look at this. It has two distinct parts. The first part is something that we know for sure. We know the beam-forming filter, because we designed them. So this is something we have in our device store, and there is a part that we have to measure that, we have to put physically into space and then measure, so it's this part here. So it's how I source excites J microphone. But now, if we knew this part here, which I denote with R-superscripted-I down there, if we knew this red part, we could actually create the sum of line. Do you agree? There is no reason why we wouldn't be able to create this sum of lines. So what we can do actually is we can learn the red parts, easily, by firing every transducer individually and recording ours and then just do the transmit beamforming offline. You agree with me? It's simple linearity and computability. There's number special to it, but it allows us to only do eight chirps -- eight chirps if you have eight sources, and recreate every beam computationally. And instead of having a frame rate of one-twelfth of a frame per second, we can actually have 30 frames per second for a certain resolution. So just why probably people -- of course, people notice, but why probably they don't use it too much is two reasons. First, for lasers, there's no thinking about it. Lasers are themselves very directional and utilizes raster scanning. In loudspeaker beamforming, usually, people want to listen to sound, so they reproduce sound for people to listen to, and you can't use this, you can't have a loudspeaker array of 20 loudspeakers and then play each one of them individually and then say to someone, okay, now do the beamforming mentally, in your head. This is not good. You want to create a sound picture, but here we don't care. We're creating an image. It doesn't have anything to do with listening to actual sound, so we can do this. This is just the block diagram. So what we do is we play a delayed version of the pulse through every loudspeaker, filtered by corresponding correction filters, store whatever is recorded by microphones. In our case, this is 64 signals, and then we create everything from these 64 signals. This actually came out from a discussion with an aide who suggested this for deconvolution. This can actually also be used for deconvolution. I'll talk about it in future work. So these are some results. Brace yourself, and then some future work that I think could be done in order to really, really improve what I'm having. So this is the experimental setup at some time of the night in the atrium of our building here. Cleaning ladies were not very happy that I was preventing them from coming here. I was doing the experiment in the atrium, because with the current design, the beam is still too wide and reverberation doesn't do me well when I'm in the room. So when I have a lot of reflecting objects in the room, the images are worse. So that's why when I went to the atrium in order to have a smaller number of reflections. I mean, the reflections are coming much later. Okay, so results. I was imaging myself, and because this thing -- I didn't mention something. Somehow, the underlying of all this is skeletal tracking. What we're trying to do with this is we're trying to see whether we can design -- it's an exploratory project, whether we can design a skeletal tracker using ultrasound instead of light. So I was standing like this in the atrium with my arms folded next to my body, and this is the image that you get. So, first thing, remember -some people get with 400 microphones, and they do a really good job. But you understand that this is actually not bad. It's not bad, because you can actually see that there is a body here, beyond all questions of how much my body reflects ultrasound, beyond the fact that not every reflection here is specular, but you can actually see something. So now this what I'm showing you, this is an intensity image, which means that in every direction, you just calculate the intensity. What we want to have are depth images. So a naive way to create a depth image is to just, as I said, point the beam there and measure the time delay. I said this is not going to work because of the sidelobes, and indeed, this is the depth image that you get with this naive approach. I'm showing this to you just for the drama. So you see everything is at two meters, almost everything, because every time -- I'm standing there, right. And every time some part of the beam, some part of the sidelobes, will catch my body and will create a reflection, so I'll think that everything is at two meters. After applying the sound source localizer to this, we get a much more reasonable thing. So you see that now, only the pixels that I can trust are shown, and these pixels are at correct distances, and they're sort of aligned with my body. Of course, this can be improved further with further work, but this is a pretty large improvement from this. So now let's see a more interesting thing, and this is kind of cool. So what happens if I spread my arms, so arms are quite small, whatever, and they're not very good reflectors and specularity, whatnot. So if I spread my arms, you can actually see it in the intensity image, which is kind of -- I had to say, maybe four weeks ago, I lost hope for this, but now I'm really happy that I got this image. This means that you can see where the arms are, so imagine having much more microphones, imagine having better algorithms. You could probably do very well. So you can actually see where the arms are. Again, just for comparison, this is with arms folded and this is with arms spread, so you see arms here and here. That image, again, naive approach, everything is at around two meters. Not very good. If you apply sound source localization now and it's good, we get something like this, so you get this alien-like creature which, with some imagination, is me in that spectrum, but this actually shows something, some body and some arms spread, so it's much better than this. And it's actually pretty good, because the distances are correct and the arms are in the correct positions. Reasons why this works is that, on the body, you have many kinks, parts, small details that actually does reflect sound in different directions, so you can hope to find something around here that will reflect sound back to your device. And what I'm showing here is the same thing, but with this fast-frame, single-shot acquisition, with linearity, again, with my arms folded. This was a different experiment, so this is not the correct photo. I should have put on a different photo from that experiment, but there were no people in the atrium that could take a photo of me at that time, so I had to give up. But you see that the image is somewhat worse. I would expect the image to be exactly the same as the other one. Unfortunately, for some reasons that I haven't been able to debug yet, the image is slightly worse, but still it's sort of correct, right? And again, everything is at -- I was standing a bit closer, at a bit less than two meters. In the naive approach, where I sort of recreate the beams, if I apply sound source localization, then you sort of sea the head, which is a strong reflector, and you see arms. This requires further work. It requires -- it's not quite clear to me why this didn't work as expected, but in theory it should, so this is probably just a bug. So these are the results, and I'll go over some of the conclusions. So what did I have to do? I had to design this hardware, with a lot of help from Jason. I had to -- this involved a lot of things, like how to interface it to a PC, definitely how to design the geometry. So I explored various image creation algorithms, and the images that we have, they have three-degree angular resolution. You could create finer images, but with current beam shape, it doesn't make much sense, right? They wouldn't have much physical meaning, unless you were able to properly deconvolve them. So three-degree angular resolution is somehow a qualitative, good number of pixels, so I propose some convex optimization based and some source localization based algorithms to improve the depth images, to improve the actual depth estimation, and so one of the major problems, the speed of sound, the proposed solution. The experiment, it still needs some work, but definitely, the solution is to use multiple transducers, but to use these simple properties of LTI systems that I can do things offline, even in transmitting. And so this takes us from one-twelfth of a frame per second for the naive approach to 30 frames per second for eight transducers for this approach, with transmit beamforming online. And the good thing about this, the great thing about this, is that this doesn't scale, right? The time, if you fire only eight transducers and get more sophisticated algorithms, you'll still need exactly one-thirtieth of a second to acquire the frame, because you only need to fire eight transducers, so it doesn't scale. See what I'm trying to say? So you can have very sophisticated algorithms after this, but the image acquiring is going to be very fast, so the limit is the sky. The major advantage of this thing is low power consumption. Again, things operating lasers, they use a lot of power. It does depend on the number of pixels, but typically they use a lot of power. In this regard, our prototype uses three watts. This is a lot, and this is because of the nature of the operational amplifiers that we're using. The currents are quite high, but if you do the design specifically for these transducers, you could achieve something like 48 transducers at once, like one watt impulse, and perhaps 10% of duty cycle. You're not using them all the time. This results in maybe 100 milliwatts of the power consumption. Then, if you go back here and you do a single-shot acquisition, when only one transducer is active at a time, you could actually perhaps divide this by five, so get something like 20 milliwatts, which is a huge difference if you compare it with what you need for lasers. Microphones virtually don't need any power. Okay, and so the question, I didn't highlight enough in the beginning of the talk, it's like ultrasound for skeletal tracking. Well, I think that probably the answer is yes. If you use higher frequency and you play with the algorithms, you have proper calibration, I'm pretty sure that you can extract enough meaningful features about positions of lims in order to create a skeletal tracker, and so let me go just quickly to some possible future directions. So, definitely, there is value in B-mode or intensity images. So perhaps we could use these images to maybe detect connected regions or something like that, and then use them to enhance the depth images. So there would be value in combining the two information, to build like a reasonable skeletal tracker. And then, for the actual skeletal tracking, well, what is the information that we use? The information that we use is 64 signals that we recorded. So perhaps the important features are times and intensities in every temporal bin of these of 64 signals. That's what matters. So what you could do is you could divide these 64 signals into bins, compute the average intensity over every bin and feed these as features for the training of the skeletal tracker, right? You see what I'm trying to say. So you could not create the depth image at all. You could completely bypass the depth image and just get the skeletal tracking from these features, which directly encode, actually, the positions of objects in the scene. Okay, we can formulate the deconvolution that I was proposing directly in the signal domain, never creating beams. So we can also formulate this deconvolution starting by eight-by-eight signals that we have, similarly for the other thing, and of course, the most mundane idea that we have is to increase the number of microphones. So, nowadays, the guy is Przybyla, one of the papers that I've shown in one of the first slides, they use something called capacitive micro-machined transducers, and these things you can fit -for example, they fit 37 microphones on a PCB 6.5 by 6.5 millimeters. It's quite fascinating. The problem currently is that they work at slightly higher frequencies, so their range is limited, but I think in the very foreseeable future, this is going to be the technology of choice for ultrasonic imaging in air. So just as an example, what could you do if you had 64 microphones? So this would be the beam shape, right, and this is the beam shape that we have currently. So, obviously, virtually no sidelobes. It's still not a point. It's still some piece of a beam, but it's much better than what we currently have. So many of the problems will be mitigated with having more microphones. Okay, well, that's roughly everything that I had to say. Thanks for listening. If you had any questions, please go ahead. >>: I would like to ask some questions. You're trying to take one picture using the ultrasound and turn it into skeletons, right. >> Ivan Dokmanic: For example, yes, but you'd do it consecutively. >>: And how many frames did you use to create those? >> Ivan Dokmanic: These are all single frames. >>: But do you think it will get better if you use multiple frames? >> Ivan Dokmanic: I would say yes. No is obviously wrong answer, so it can get worse, so it will get better, because it will have some modeling over time, so you could probably use this also to get higher resolution. >>: And it just occurred to me. You're using a single pulse, right? Single frequency pulse, like >> Ivan Dokmanic: We use a band, like from 38 kilohertz to 42. >>: Yes, but do you think it's possible to shift that -- like make the beam a little smaller and shift it into multiple frames, so for instance, on the first frame, you shoot from 38 kilohertz to 39 kilohertz, on the second frame you shoot 39 to 40, something like that. >> Ivan Dokmanic: OFDM kind of ->>: You get a similar chirp-like signal, but it's not exactly a chirp, but it's like a beep. >> Ivan Dokmanic: Yes, definitely. We were thinking about things like that. One thing that you could do for sure is use multiple frequencies, not like 38, 39, but try to use 40 kilohertz and maybe 80 kilohertz, or 40 and 73 -- I don't know. And then this could be really useful to remove, for example, coherence speckle or some effects that are the artifact of the actual frequency that we're using. So things that you're suggesting would be great, also, for improving the frame rate further, in the following sense. If you could use this kind of OFDM idea, if you could use separate bands, you could actually emit one pulse -- so one beam in one part of the spectrum, other beam in other parts of the spectrum and other beams in the third part of the spectrum, all at the same time. So that's a good idea, but it requires a lot more design, and you need a slightly more wide band. So I had a lot of trouble with this thing, with repeatability of these transducers. Sometimes, they don't work as expected, so you need a more reliable source of ultrasound that's more wide band, because these things are very narrow band. It's 40 kilohertz and a bit a side, so we need to push things a lot, right? Maybe you'd like to have something that has a much wider bandwidth, and these things exist. >>: More questions? >>: Two questions, actually. First, in terms of what [indiscernible], I'm most interested in a dynamic interacting [indiscernible], so can you exploit -- can you exploit the difference from frame to frame? So you issued that as one frame, and so I don't even care what's here. When I do that lag, it's like what's difference now? These are the parts and then try to figure out which parts of that are different. That's the part that's using -- >> Ivan Dokmanic: Definitely, yes. I think that's a good idea. It's actually now that you're saying this, this kind of reminds me of maybe someone is familiar with face decoder [ph] or something like this. You're trying to determine a real frequency. You have an FFD of something, and you're trying to determine the real frequency, right? So the way to do it is to take two consecutive frames and see the phase change and interpolate the frequency. And then you're not limited to the beam anymore. You can get the real frequency. So, yes, I think this could be really useful, because also an intensity image could be used for that, right, because some intensities would slightly go down, some intensities would slightly go up, so you could probably interpolate, play on this, to get a much better image. >>: And the second question is [indiscernible] position of that matrix, of getting the various functions. So one of your hypotheses was there was only one single [indiscernible] from uncertainty [indiscernible], but due to multipath that's actually not true, right? >> Ivan Dokmanic: Well, if you really have a really narrow beam, that you can really listen to this particular pixel, and then what has to happen is that you have a multipart component that comes from the exact same direction, and this is very unlikely to happen. Well, because it's probability zero. In the limit case, where you have infinitesimal pixels, if you're reasoning exactly to this direction, then to have also multipart component coming from exactly this direction is probability zero. A lot of random things should align in order to have a couple of reflections and then suddenly a reflection again from there. >>: But the probability, isn't it the same that something comes back to that? >> Ivan Dokmanic: It is, but infinitesimal pixels for me are -- okay, perhaps I didn't understand your question, and I see what you're saying. So, for example, if I have a mirror, how will I know that it's a mirror, that it's not something over there. That's a good question, and the answer is, without further information, there's no way to tell it. So if you're just using this information, there's no way to tell. You should be able to estimate the positions of such big reflectors. One way to do it is -- it would probably be too slow, but since this is all modeled by image sources, actually, one way to do it would be to try to combine pairs of pixels to get a third pixel. How to say this? So this reflection from that wall will also have a corresponding object there, so they have a clear geometrical relationship, so the object that creates the reflection will also be hopefully inside the scene. So since this object, you have seen this object and you have seen its reflection, but maybe you know that there is a wall here. There's this geometrical relationship between the reflection and the real object, and you can detect the reflection by exploiting this image source relationship. So you can say, okay, ah, this guy is actually a reflection of this guy. I'll just burn it out. That would work, but it would probably be too slow, because you need to combine pairs of these. But that's one idea on how to remove this thing. >>: Other questions? >>: The problem of the sidelobes picking up reflections where those are [indiscernible]. Did you consider treating it as a filter design problem, whereby at the expensive of increasing the size of the main lobes so that you could suppress the sidelobes and reduce the probability of a reflection falling on the sidelobe, but at the same time reducing the resolution because of the [indiscernible]. >> Ivan Dokmanic: Actually, I have it. That was initially one of the ideas, to sort of design the beam pattern, then, later, we didn't really do it. The reasons are we hope that the sidelobes are not going to be so loud, and then later I just forgot about it. Then, the main lobe is still quite wide, and if we did it like that, then it would be even wider, considerably wider, probably. So it would be beneficial -- okay, so the main problem with sidelobes is with specular reflection. So I'm trying to get something from there, and say that I'm hoping -- I don't know why, but I'm hoping for a small, diffused reflection from something over there, and here I'm getting a strong reflection. So this diffused reflection is, for example, 1,000 times weaker than a strong specular reflection from there. And I doubt that, especially with eight microphones, you can create a beamformer that will suppress everything but the main lobe more than 1,000 times. I'm sure that you can't. You see what I'm saying? It's like the problem is that if there is a strong reflector in the scene, and even if you're shooting just a bit of sound or releasing just slightly to this direction, you'll hear everything coming from this strong reflector. One reflector in the scene has a reflectivity that's 1,000 times stronger than the other things in the scene, so maybe some fabric and a piece of wood. I'm just talking gibberish, but then everything would seem to be coming from this piece of wood. I'm not sure if I'm clear. >>: If you were to practically deploy it to say in a living room, where there are reflections that you expect always to be there, you wouldn't expect to have [indiscernible], could you have a calibration procedure that determined where they're likely to be, make sure that the nulls of the sidelobes are pointing in the directions where you actually ->> Ivan Dokmanic: Definitely that's a great idea. Yes, I didn't think of that, and this seems like a really practical and good idea. Why not? So we just sort of move out of the way, calibrate, and then go and use it. This is actually good, especially with more microphones and more transistors. You could certainly play on that. >>: In some senses, that's a bit like MVDR, in the sense that if you walk out, then your noise is your unwanted signal, which might be reflected. You first try to minimize that such that when you actually come back in, then the only thing that's in the room is the wanted signal, and that's going to be you. >> Ivan Dokmanic: I like this idea. I think this could be really good. >> Ivan Tashev: More questions? >>: Can you increase the microphones without increasing the number of transducers? >> Ivan Dokmanic: Of course. That would be something that you could do. So there is benefit in having more than one transducer, and the benefit is the following. You can sort of partially isolate regions of space, which is good for sound source localization. I can elaborate on that later, but definitely, what you would do is you would not increase the number of transducers too much, but you would increase a lot the number of microphones. >> Ivan Tashev: More questions? Okay, let's give thanks to our speaker today.