>> Ivan Tashev: I think we can start. ... morning to those who are just online. Today, we...

advertisement
>> Ivan Tashev: I think we can start. So good morning. Thanks for coming here. Good
morning to those who are just online. Today, we have Ivan Dokmanic from Audiovisual
Communications Laboratory at EPFL, in Switzerland, with his adviser, Professor Martin
Vetterli, and today he is going to present the results of his three-months-long internship in
Microsoft Research here. Without further ado, Ivan, you have the floor.
>> Ivan Dokmanic: Thanks, Ivan, for the introduction. So, I'm Ivan, and I am going to tell you >> Ivan Tashev: There is a difference. I am Ivan, he is Ivan.
>> Ivan Dokmanic: That's right. So I'm going to tell you a bit about what I was doing in the last
12 weeks, actually. So let me start first things first. Just a small thanks, or a big thanks, to
everyone that helped me out with various technical difficulties while I was doing my project. So
the audio group, Mark -- I couldn't find a high-resolution photo, Jans [ph] and Ivan, and also I'd
like to thank the hardware team, so Jason Goldstein and Tom Blank. Specifically, Jason, who -Jason, I was just mentioning you, who helped me a lot with building and debugging the
hardware.
So let's start with a short overview. Basically, I'm going to start with trying to motivate why
you'd use ultrasound for something that's typically done with light, and it's done very well with
light, and go through some basic principles really fast of ultrasonic imaging. Then I'm going to
talk about our hardware design and beamforming, how to get directivity, some calibration issues,
some other things that happened, and so on, and then I'm going to move to algorithms. I'm going
to describe the naive algorithm. So the first thing you try to do, imitate the way you do it with
light, and then why this doesn't work really well and the things that we tried out and some of the
things that turned out to work much better, some things that we propose, specifically for
increasing the resolution and for increasing the frame rate, so for obtaining a usable frame rate.
And then I'm going to go over some experimental results and suggest things that could be done
in the future to get even better results.
Okay, so let's start with why ultrasound. Depth imaging is typically done by light, either by
using the disparity between two images, two photos of the same scene, or by using lasers, for
example, in Kinect and in Lidar. So this is good, and this works fine, but, for example, if you
want to use cameras, you actually need the camera. You need a device that does it, which can be
quite large, quite expensive, can use a lot of power. With lasers, there's potential always to use a
lot of power, and these are all very fast signals. Light has a very fast frequency, so you have to
take special are in the design in order to design the circuits so that they're good for RF
frequencies and so on. So what are the ways that ultrasound can serve, provide something else?
Some of the advantages are the following -- first, the frequency range. The frequency of the
ultrasonic sound that we use is 40 kilohertz, and 40 kilohertz is fairly low, in comparison with
the frequency of light. So this means that you can design very simple hardware. You don't have
to pay special attention to PCB design, to RF design and so on, so the design is very simple. On
the other hand, the propagation of the sound is fairly slow, in comparison with light. So this
means that your wavelength is not going to be too large, which is good, because for imaging, you
don't want your wavelength to be large.
This is somehow a fortunate coincidence. Other good thing is power consumption. So I snipped
here two images from the data sheet of the transducer that we're using, and you can see here at
frequencies of interest, the impedance of our transducer is something like 500 ohms, and we
want to drive it get maybe 100 dB, SPL of maybe 105, so we want to drive it to maybe five volts
RMS, which results in 10 milliwatts of power consumption if you use it full duty cycle, but we
don't use it full duty cycle. We use it only a part of the time, when we emit pulses. So do your
math, we can really go low in power consumption.
Sensors are cheap. They're quite available off the shelf. The ones that we have are fairly cheap
ones, and in some ways it's complementary to light. So light might fail if you have, for example,
a thin piece of fabric in front of you or if the space is filled with smoke or if there's a lot of
mirrors. It could be complementary in some regards. So, of course, there's a lot of challenges
with ultrasound, perhaps much more than advantages when you look at it for the first time. So
first challenge is the frequency is still low, so there's no way to get the level of detail ultimately
that you would get with light, because the wavelength is, for example, of this ultrasound that we
use, a bit less than one centimeter. Another issue is directivity. If you have lasers, lasers are
quite directive, by themselves, right? There's no mention of trying to make them more directive,
but with sound, sound sources are typically not so directive, and so we need to find a way to sort
of imitate what we have with lasers with sound. Another issue is slow propagation, and that's a
very annoying one. So if you want to imitate light, so if you want to raster scan the scene using
ultrasound, something is maybe three meters away, for a pulse of sound to travel there and back,
it takes maybe 15 milliseconds, which would amount to maybe 60 points, 60 pixels, if you scan
in a scene, or if you want more pixels, you'll have a really lousy frame rate. This is not good.
This is completely unacceptable, so we need to deal with that. And the most annoying thing,
probably, so the specularity of sound reflections. So sound always reflects specularly. Very
little -- a very small part of this reflection is a diffuse reflection, which means essentially that if
I'm standing here, and I'm shooting directly to the wall, I'm going to hear a strong reflection. So
if I have a source of sound here and a receiver there, then a strong reflection is going to happen
somewhere there. But if I'm trying to see what's happening there with the source and receiver
standing here, I'm out of luck, essentially, because the sound, it won't reflect here. There is a
small diffuse component, but it's completely impossible to see it with what we currently have, so
we also need to somehow deal with this.
Okay, another challenge is, of course, attenuation of ultrasound in air. Sound always gets
attenuated when it travels through air. There is -- for point sources, we have 1/R law, just
because if you put some energy in a point source, then it travels and this energy is spread over
larger and larger spheres. So if you're listening in a certain point on this sphere, you hear a
sound that's a bit lower intensity. But ultrasound gets attenuated in air because of the
compression and expansion of air much more so than the ordinary sound.
In fact, above frequencies of around 100 kilohertz, you get attenuation of more than three
decibels per meter, which is quite substantial, so for something two or three meters away, you
will be attenuated something like 20 decibels, which is quite a lot. So, with all these challenges,
a completely legitimate question is why bother at all? Why try to do it? And we have a good
example in nature of this beast that can actually do quite well with ultrasound. We hope that the
limit is far away. We still don't do as good as they do. These are quite fascinating little
creatures, and they use frequencies that mostly are ultrasonic, so all the way up to 200 kilohertz,
and they have fascinating resolution, resolving ability. For example, the big brown bat can
detect spheres of roughly two centimeters across at five meters away, and perhaps the more
fascinating thing is they can resolve detail that is one-fifth of the wavelength that they are using,
so beyond the diffraction limit, using some heavy spatial processing in their right brain. We're
not quite there yet, but we can hope for it in the future.
So let me go quickly over some related uses of ultrasound, and ultrasound is mostly used -- the
vast majority of ultrasonic applications are in biomedical imaging, and for a good reason, right?
So not so much ultrasound was actually used in air, and a typically image that you see in
biomedical applications is an image like this one, and this is called a B-mode scan, so what is a
B-mode scan? How do you obtain a B-mode scan? The simplest possible way of obtaining an
ultrasonic image is by just inserting sound, so emitting sound into the body, and then recording
reflections as a function of time and plotting them as a function of time. If you do this in two
dimensions and encode the amplitude or the reflection as the intensity of the pixel, you get an
image like this.
So in body, you can do that, because always some part of the sound will get retransmitted. So
something will reflect on here, but some sound will go on. Then it will reflect further. The
medium is like this random scattering medium, so every point of your tissue acts like a small
scatterer and reflects some sound back. Okay, so in air, perhaps you know, in the late '70s, that's
one of the first users. Polaroid was using ultrasonic sensors as range sensors in their cameras,
and then there were people like Lindstrom in the early '80s that tried to image the spine
deformation. They were trying to use ultrasound in air to image spine deformation, so they were
scanning mechanically an ultrasonic device along the spine, and they were obtaining images like
this one. Then, in the early '90s, they were trying to use ultrasound to measure the skin contour,
so it's slightly higher frequencies. And these sensors were always very close to you, so this is
not our scenario, where we want to image something that's further apart, further away.
Okay, so more close -- closer to what we're doing, we have some more recent developments, by
Moebus and Zoubir, for example, from 2007, where they used 400 microphones in a synthetic
aperture to try to image objects that are a meter or two meters away from the imaging device.
And we also have some attempts to build aids for visually impaired people. For example, this
guy's 64 electric microphones to image the scene, and these guys here also use ultrasound in air,
but they use different -- I'll come back to this later a bit. They use different kind of transducers,
different than us, something very small, very efficient and something that's probably the future,
so that's why I'm mentioning it here. And what I'm showing here are two images from the paper
by Moebus and Zoubir. So they were trying to image -- here, they were imaging a PVC pole.
I'm not sure exactly what distance, but I think it was five centimeters across, with 400
microphones, so this is the image that they obtained. And here, they're trying to image a cubical
object, also the same distance. So you understand that specularity is a big problem. You see
now what I was talking about when I was talking about challenges of ultrasound.
There is no way to get a reflection from this part of the pole, here, because this doesn't reflect
back to the device. So keep these images in mind for later. So let's talk a bit about hardware and
beamforming and everything we have to do with this. So the first thing that I had to do was I had
to choose the proper transducers and proper sensors. And this was quite a challenge, because
you sort of need to understand what will play correctly with the hardware that we have here,
motor units, how to interface these things to a PC and so on. And so, very common design of the
transistors is piezoelectric, so just a piece of crystal that vibrates when you apply AC voltage
across it, and these are some typical designs. So this is a closed one, if you have to expose it to
the elements. This is the one that actually we use. The device is here. The whole thing is here.
I can send it around later. And then, you have these things all the way -- so these are really small
ones. These are really big ones. They spend something like hundreds of watts, and I found this
black one, this guy here, I found on eBay, and the seller lists a couple of intended functions. He
ways, one, medical treatment. Two, beauty device. Three, heal allergy. So there we go. Yes.
Here we have another example. This thing here is 150 watts, okay, and it goes up to 35
kilohertz. I'm just showing it to understand that there is a huge diversity of these things, and the
guy whose webpage I took this from says, the sweeping ultrasound can cause certain adverse
effects like paranoia, severe headaches, disorientation, nausea, cranial pain, upset stomach or just
plain irritating discomfort. Great for animal control. A lot of interesting things out there.
So this is our device, and once again, thanks, Jason, for helping immensely with this thing.
Okay, we're sort of debugging this infinitely, and still it has a lot of problems, but we did our
best. So let me talk a bit about microphones. So I explain what kind of transducers, what kind
of sources we can have. These piezoelectric devices, they can also act in the opposite direction.
When the crystal vibrates, they generate AC voltage, so you can use them as microphones, but
we decided to do a different thing. So we're using here MEMS microphones, typically intended
for use in cellphones, and these microphones are typically intended to go up to maybe 10
kilohertz, but it turns out that they also hear 40 kilohertz. Their polar diagram should be
omnidirectional, which at 40 kilohertz and with our design, apparently it's not. I'll talk about it
later. This is kind of a challenging point, but the nice thing about this particular model by
Knowles is that it has a preamplified differential output, so even if I drive it to the preamplifiers
I'll show on the next slide, actually, I could this directly to drive line inputs. This is very
convenient, and it has a differential output.
These are the transducers that I was talking to. We interface everything to a PC using to MOTU
using DB25 snakes and so on. This is the battery to power the microphones, and this thing uses
a separate power supply, because we need to drive it to maybe 10 volts RMS, so we can't use a
battery for that. We could. Okay, this is the overall design of our system. We have here the
microphone array and the speaker array. Microphones go to microphone preamps and then to
MOTU unit, and everything is driven by MATLAB from a PC, using the Firewire -- is actually
USB. A cool thing, as I'm saying, is that you could actually drive these guys directly to MOTU.
This is something that we learned later, but this is just in case we wanted to preamplify them a
bit.
Okay, so let me talk a bit about beamforming. For those of you who perhaps don't know what
beamforming is, which is none, so we have to find a way to direct sound to certain directions,
and also to listen directionally. So it turns out if you have more than one microphone or more
than one loudspeaker -- even if this microphone is, for example, omnidirectional, with more than
one you can achieve so that the overall characteristic, the overall directivity of the system is
selective. So by properly playing with the signals that you record from these microphones. So
we can use this to create a beam of sound or to listen in a beam of sound and try to steer this
beam electronically, which is another cool thing, because you can do it electronically by
changing the way you play with these signals, and to scan the scene. So there's many ways to do
beamforming, but somehow two opposing extremes are something we call minimum variance
distortionless response. Beamformer, it's not important what it really is, and delay and sum, and
they're different in the following way.
MVDR is great if your system is perfectly calibrated, and then it gives you the best possible
beam. But if your calibration is imperfect -- here, calibration is imperfect -- then actually it
could perform worse than a very simple beamformer, which is delay and sum, and which turns
out to be the most robust to manufacturing tolerances. So we sort of lean towards the delay and
sum beamformer because, unfortunately, we don't have excellent calibration of our device, even
if we wanted to have it. We tried.
So this is a polar diagram of a three-element microphone, of a three-element microphone, just to
understand how these beams like. So if you're for example listening -- microphone array, right?
So if you're listening toward the direction of zero degrees, we get sound intensity of one, and if
we're listening, for example, to 30 degrees, then we get less than a third, but still quite
substantial, right? There's still a lot of sound coming from this direction. Okay, something cool
that we can do using this beamforming theory is we can think about how to properly design the
geometry of our microphone and loudspeaker arrays, because obviously not all geometries are
equally good. Some geometries provide us with better beams and some with worse beams. We
need to find a way to measure this, and a classical, good, reasonable way to measure it is using
something called the directivity index, which just if B here is the directivity pattern, the
directivity index just says what is the ratio of the directivity towards the direction that I want to
be looking at with respect to the average directivity, in all directions? And this is a completely
reasonable measure, and so what I was doing is, I took a couple of candidate geometries, so cross
geometry, a square geometry and something like a circular geometry, and I was varying for reach
of them the spacing, the pitch between microphones, and I was evaluating the directivity index
for the corresponding geometry, and I was trying to find the best one. It turns out that for a
microphone array, somehow, the best overall geometry was this square one, and the spacing
between the microphones that we used is 6.5 millimeters, which is slightly more than lambda
half, than half of the wavelength.
Okay, so this is the finalized geometry. For loudspeakers, for buzzers, for sources,
unfortunately, we couldn't optimize it properly, and the reason is the physical size of the thing is
just too large, so we have a mechanical reason not to be able to put it as close as we would like
to. So their pitch is 11 millimeters, which is the smallest mechanically allowed pitch. And what
you can see here is that we angle the transducers a bit out. We tilt them out. Again, this 20
degrees here is not an accident. It's an optimized number, somehow the best one, and the reason
is that the transducers themselves are a bit directional. So in order to cover a wider field of view,
it's a good idea to just tilt them a bit out. It turns out that the beams in extreme directions will be
better if you do so.
So this is just a simulated beam from our microphone array. This is just to understand that even
if you want to have a point here, a pixel, we don't have it, unfortunately, so this is what you
would get if you would be looking at the beam from the front and just flatten it. So if you have
one scatterer, a point scatterer, then instead of seeing a point scatterer, you see something like
this, so this is the meaning of this image. So if you remember how it looks like, it's three
microphones laterally, essentially, and three microphones vertically. So with a three-microphone
array in some direction, you can't use much. We have eight. But if you want to extrapolate the
performance you have in a horizontal plane, you need to square the number of microphones. So,
for example, the performance that you get with eight microphones horizontally you would get
with 64 microphones in the whole space. Okay, so let's go on. Calibration. Calibration is
important. Of course, it's important. We want to have perfectly calibrated things. For example,
here, we have the frequency response of our transducers, actually, of some other transducers, but
all of them are roughly the same. So you see that at 40 kilohertz, they're the most efficient. And
as you move away from 40 kilohertz, they are less and less efficient. So we want to equalize
this, because we assume in beamforming that they have a flat response. So we need to push the
frequencies away from 40 kilohertz. We need to push them up. So things to calibrate are phase,
amplitude and polar response, so we should also measure the directivity patterns of these small
devices if we want to do a proper, complete job. So the way I did this is in the anechoic
chamber, here, and I have a laser here, so I stuck a laser here, and I was pointing this laser at the
microphone to ensure that I'm really looking at the main response axis of these arrays. I was
doing it many times.
Unfortunately, I was only able to properly calibrate the loudspeaker array. There was an issue of
repeatability, and I couldn't figure out why, but especially with microphone arrays, their
directivity pattern is not at all omnidirectional, so small tilts of these things resulted in
completely different gains. So we had to use some sort of overall good gain. Now, enough of
that, so let me just show you the actual correction filters, how they look like for the loudspeaker
arrays, just to show you that they make sense, that they look like something that we expected.
So on the left-hand side, you see the correction to the amplitudes that we have to make, so you
see at 40 kilohertz, they're very efficient, so we don't need to do anything. Away from 40
kilohertz, we have to amplify them, and you see that phases are fairly reasonable. Here, you also
see another thing. You see that two of the transducers are very silent, for whatever reason, so we
really didn't have time to debug this piece of hardware anymore, so we just said, okay, we're
going to amplify them digitally. So you see these two transducers we have to push much harder
than the other ones, in order to perform the same.
So to highlight the importance of proper calibration, I'll just show you an example of a
loudspeaker beam with and without calibration. Just this morning, I realized what I'm showing
here is actually the microphone beam with loudspeaker correction filters, the message is the
same. Don't worry. So on the left, you have the calibrated beam, which I already had shown
you, and on the right, you have an uncalibrated beam. Perhaps this image shows it better. So the
uncalibrated beam looks weird, and even if it's more beautiful than the one on the left-hand side,
we actually want to have the one on the left-hand side. And you can see actually this sidelobes
are most likely the microphones that we have to push harder, have to do something with them.
Okay, so now that I'm done with hardware, let's go to algorithms. Okay, so -- first, I want to
explain how, in general, you create ultrasonic images. What do you do? So I already was
talking a bit about B-mode imaging, where you emit sound and then plot reflection intensity
across time, along the time axis. In every mode of ultrasonic images, you emit some pulse and
then you measure some parameters of the returned pulse, and depending on which parameters of
the returned pulse you measure, you get different modes of imaging. So B-mode I already
discussed. So in air, when you use ultrasound, B-mode imaging makes limited sense, because
nothing is getting retransmitted. If I get a reflection from that wall, I should not hope to get the
reflection from the wall in another room, the way it works in your body. So it makes much more
sense to do something that we call intensity image, so just fine the biggest returned pulse from
each direction and plot its intensity against angle, so this will be somehow an intensity image.
And then in that imaging, we're not interested in the amplitude of the pulse. We're actually
interested in the timing of the pulse, on some temporal property of the pulse, so when the pulse
comes back. That's how we create depth images.
So these are somehow different approaches. I'm going to quickly go over them and some
problems, and then I'm going to tell you how we solved them. So now your approach is just a
raster scan, so to do exactly what a Lidar does, for example, a raster scan, pixel by pixel, find the
time of flight in this direction, find the time of flight in this direction, plot the times of flights and
you get your image, right? So the problem with this is you have seen the beam, and in reality it's
worse. So this is not going to work very well. I'm not going to be getting time of flight from this
direction. I'm actually going to be getting something that's coming from the sidelobe, most of
the time. I'm going to come back to this later. So we sort of propose some techniques for
deconvolution. I explain how this is different from deconvoluting an intensity image and how
we can deal with this.
So then I'll propose another method, based on sound source localization, so how we can couple
sound source localization with beamforming to get slightly better images, to get actually usable
depth, quote-unquote, images. And I'll tell you how to improve the frame rate. So frame rate is
one of the most annoying issues, so I'll explain you how to improve the frame rate from the naive
approach, where the frame rate is completely unacceptable, to something that's completely
acceptable, like 30 frames per second, while exploiting the whole -- every transducer that you
actually have. So, before moving on, I'll just briefly go over the choice of the pulse that we have.
This is a topic in its own. Bats use chirps. We don't use chirps. Bats are much smarter, so they
use chirps. We use something that's quite picky, and it's like a filtered Dirac pulse, so it's just a
complete pulse band pulse between 38 kilohertz and 42 kilohertz that we use. I could have
plotted the autocorrelation of this thing so that you understand how our image, when we try to
estimate the time delay looks like, but the autocorrelation of this thing is actually this thing, so it
was not necessary, right? So you see how the spectrum looks like, so this is just a filtered Dirac,
and here's the formal. It's the difference of two things. Okay, so this pulse turned out to be very
good for detecting reflections when they come back, but perhaps you'd want to take some more
time to actually design a perfect pulse.
Okay, so let me explain a bit about deconvolution and why we have to think of different ways to
do deconvolution than conventionally is done in ultrasonic imaging. So typically, deconvolution
is done for intensity images, and so now imagine that that wall over there is a source of noise, is
a noise generator, and different parts of this wall generate different amounts of noise, and so
imagine that the wall is created by very small pixels of noise, something smallest that you can
detect. Well, then I direct my beamformer towards certain parts of that wall, and I'm not able to
listen to individual pixels. What I'll be listening to when I direct the beamformer in some
direction, I'll be listening to a weighted sum of all these pixels. I was trying to represent this
here, mildly successfully. So you get the weighted sum of this intensity across the beam shape.
So this is actually convolution, some convolution, either spatially variant or spatially invariant
with the beam shape, and this is a linear equation, so deconvolution here amounts to inverting a
linear system. Then you can talk about conditioning. You can talk about do you really know the
beam shape and things like that, but basically it's inverting a linear system. That's deconvolution
in intensity images, right?
For us, these small beams are supposed to represent these individual pixels, so what is the image
creation model that we have in depth imaging? Well, imagine that you have -- that each small
reflector here, each pixel, emits something, reflects a part of your sound back, and this sound, the
reflected sound, I call X of theta, so for different directions, theta and T. So this is a function of
time. Then, again, this gets involved with the beam shape, because you're not listening just to a
particular pixel. You're actually listening to a weighted sum of pixels, so there's a convolution
with beam shape. Then, what we do is we cross-correlate this with our pulse templates in order
to find peaks, right? So I just wrote it as a convolution, with some pulse, then points U. And
then we actually create the pixel by finding the maximum, by finding the largest peak, okay? So
this is the time. The time of the largest peak, so this is our depth. Obviously, this is not a linear
model, because there's a maximization inside. We cannot write this out as a matrix
multiplication of some true depth image with some beam shape. Our measurements are not
linear, okay? So there's no hope to just do very simple deconvolution, as is the case in the linear
-- in the intensity images. I hope this is partially clear.
>>: Why not the earliest time?
>> Ivan Dokmanic: The earliest time?
>>: You said the maximum?
>> Ivan Dokmanic: Because the signal -- you will not actually get. So what if you have a
continuous surface? So not discrete distances. Then, you'll get some sort of a smeared pulse.
That's a good idea, and actually I'm using this idea in something else with source localization,
but the thing is that you're not really getting the correct -- if you have a very wide beam, you'll
get some weighted sum of these pulses that will overlap in time, so it's the question of how to
disentangle these pulses. Did this answer your question?
>>: Because ->> Ivan Dokmanic: Most of the time, you will not get discrete, separated pulses. They will be
overlapping, if the surface is continuous, so I'm somehow trying to give the complete creation
model. For what it is, the first part of the creation model in this case is definitely convolution, so
if you look at certain times, the output of the beamformer is actually some combination of the
outputs of this perfect beamformer, so the reflections that you would be getting from the smallest
pixels. So by X here, I denote the smallest pixels, and for every time you get some combination
of them. Partial I here represents all the pixels that contribute to the I beam. That's just a
shorthand notation for that. So this, again, is some kind of a convolution. It is a convolution,
right? It's a two-dimensional convolution, and you can write it out as a matrix multiplication,
with some [indiscernible] like matrix A here. So, since this is correct for every time instant, we
can sample the left-hand side and the right-hand side, and we obtain a matrix equation like this
one here. Now, observe that this matrix equation has some vertical structure determined by the
convolution, so vertically, things on the left are some convolution of things on the right with the
beam, but what we don't exploit here quite yet is the fact that this matrix X has horizontal
structure. And this horizontal structure is the structure of the pulse. So let me explain this. If
you emit a pulse and you have just one reflector in your scene, one point reflecting, what you'll
be getting back is the same pulse again, at least in shape. Forget now filtering, but roughly what
will come back is the same waveform, attenuated and delayed, right? So this means that really
every row of X here has to be an attenuated and delayed copy of the pulse, of the pulse that you
emitted. So this is -- we can enforce this in our modeling by representing -- sorry, this should be
D here. D. By representing X as a multiplication of a selection matrix, of a delayed selection
matrix and the pulse dictionary -- so D here is a pulse dictionary. Every row of D is a different
shift of our pulse. And then, matrix B just acts so as to select the proper shift, for each individual
pixel. These ones are just representing nonzeros. B is obviously very sparse, and we know
exactly how sparse it should be. Ideally, it should be having only one nonzero per row, because
each smallest -- each infinitesimal pixel only gives one single reflection. And so the way to
obtain the true B, which is then the representation of our depth image, is by trying to minimize
the vector, it's like a compressed sensing, trying to minimize the vectorized one norm of our B
matrix, so it's trying to minimize the number of nonzeros while ensuring that what it yields is not
too far away from what we measured. And then we can have some non-negativity constraints.
So one thing that you could try doing is, you could try to eliminate the dictionary from this
matrix by multiplying everything from the right by the inverse of the dictionary.
This will generally perform worse, because the conditioning of this dictionary is not very good,
and so this will amplify the noise, and it will get a worse equation, but an advantage of this thing
is that it's now separable. Here, B is in the middle of the two matrices. Here, B is sort of alone,
and you can solve this column by column, so it has potential to work faster. This is just briefly
some simulation results. So for this, I only have simulation results. The reason is, we haven't
been able to really measure so far the exact shape of the beam, and this requires it, not perfectly,
but at least to some extent. So this is some artificial depth map that I created, low resolution,
because this thing is, for now, slow, and this is the mask that I was using.
So I convolved this thing with this mask, and I generate temporal signals using some pulse that I
generated, and I add a lot of noise -- a lot of noise -- to this pulse, and then I run the algorithm,
and this is what you get. So you can get deconvolution performance like this one even with 10
dB SNRs, so with a lot of noise, which is much more than you would be getting in practice from
ultrasonic images. So this is just to show you that this thing really works, works surprisingly
good, but unfortunately slow, and for real problem size, it's still too slow for anything real time.
Okay, so next problem is, of course, these sidelobes, and the thing that I was talking about
before, so say you think you're looking there, but your sidelobe is looking towards that wall, and
you think you're getting a reflection from there, but if a reflection from there is 1,000 times
weaker and your sidelobe is only 100 times weaker in this direction, you'll be hearing this wall,
and it's pictorial here. So in practice, we don't have a one to 100 ratio between the main lobe and
the sidelobes, so this essentially means that you don't know what you're listening to, because of
the sidelobes. There may be nothing there, but you'll get a reflection from here. I hope the
illustration helps you to understand.
So our proposed solution is to couple this with source localization, and to do it in the following
way. So what I'll do is I'll listen to that direction, and I'll beamform in order to get as directional
thing as I can in this direction, and then I'll find peaks in the beamform signal and then I'll go
back -- I'll show this again. I'll go back to the row signals, recorded by microphones, and then
I'll feed them into the source localization algorithm, and I'll try to find out where these locations
are really coming from. And if where these reflections are really coming from doesn't agree with
where I'm looking at, I will not put the pixel there. So this is a block diagram of the whole
system. I have microphone signals here. I feed them into the beamformer, but also this
generalization of music that I implemented, which is just a generalization -- it's like both for
azimuth and elevation. Music is a very well-known source localization algorithm.
So, for example, I concentrate on some frame. This is one frame. This here, highlighted in red,
is one frame, so it's one direction, and I feed this into the beamformer. This is the output of the
beamformer, and in this output, I find peaks. Here, I find one peak, but actually, I find several
peaks every time. I find, for example, the largest three peaks. Now what I do is, I zoom in to
this peak and go back to row signals, and I feed these row signals into the source localizer, and
then I try to see if the direction this reflection is coming from agrees with where I think I'm
looking at. And I do this for many peaks inside every beamform signal. This is an output of the
source localizer. This is just the score for each direction, for each possible direction, and you can
see that, for example, in this particular frame, there were two sources active. For example, I
could have been standing like this and imaging myself, and I got a reflection from this hand and
this hand together. This makes sense, because they arrived at the same time, but they're
obviously arriving from different directions. Okay. What I really do in practice is, for every
direction, I make a list of distances, so I'm not just saying this is the distance in this direction. I
actually make a list of distances. So I have a large list of distances for every candidate distance,
for every direction, and in the end, I pick the smallest one, because the smallest distance makes
somehow physical sense, and it turns out to give good results. These are just for fun some
possible outputs, some possible score maps of the sound-source localization algorithm.
Remember, we don't have so many microphones. We have eight microphones for the full
azimuth and full elevation, so that's why this is not so picky. But you can see that sometimes we
have a clear source active. Sometimes, we have multiple sources in the same frame, which
completely makes sense. So, okay, I'll show you results for this later. So now I'll move to the
next problem, and the next problem is frame rate. So what people do with frame rate is, because,
frame rate suffers because if you want to do raster scanning, so you want to point the beam at
every possible location, then it's just too slow. Sound travels too slow and it's going to take too
much time. So what people typically do is they say, "Okay, we can do microphone beamforming
in the computer. We can do it offline. So we'll just use one source, just splash everything with
sound and record it with microphones, need a lot of microphones, and then do microphone
beamforming offline. But the limit on the frame rate is then quite high, because, well, you just
need to send one chirp and then only do the receive beamforming.
So now there is benefit in using multiple sources of sound, also. I give here an example by
Moebus and Zoubir, who do a great job in imaging with 400 microphones, but they use this
argument. They say, okay, we're not going to do transmit beamforming, because that's too slow,
so we just use one source of sound. So what we propose to solve this problem is something
really simple. It's just basic properties of LTI systems. I'm going to tell you a bit about why
probably people don't do it too much. So let's go over, again, what microphones here. So this is
the signal of microphones here. So lets' say that S(i) here is the signal emitted by the I source,
and the H(ij) is the channel from the I source to the J microphone. Then, for all the sources, the J
microphone just hears the sum of this, of these convolutions. That's very simple. So, for us,
every signal emitted by a certain source is a filtered version of some template pulse that we're
emitting, so we can write now this S(i) as sum W(i), which is the beamforming filter for this
particular beam and this particular microphone, and U is the template pulse that we're using.
Okay, so we can rewrite this in the frequency domain just as multiplication.
So now look at this. It has two distinct parts. The first part is something that we know for sure.
We know the beam-forming filter, because we designed them. So this is something we have in
our device store, and there is a part that we have to measure that, we have to put physically into
space and then measure, so it's this part here. So it's how I source excites J microphone. But
now, if we knew this part here, which I denote with R-superscripted-I down there, if we knew
this red part, we could actually create the sum of line. Do you agree? There is no reason why
we wouldn't be able to create this sum of lines. So what we can do actually is we can learn the
red parts, easily, by firing every transducer individually and recording ours and then just do the
transmit beamforming offline. You agree with me? It's simple linearity and computability.
There's number special to it, but it allows us to only do eight chirps -- eight chirps if you have
eight sources, and recreate every beam computationally. And instead of having a frame rate of
one-twelfth of a frame per second, we can actually have 30 frames per second for a certain
resolution. So just why probably people -- of course, people notice, but why probably they don't
use it too much is two reasons.
First, for lasers, there's no thinking about it. Lasers are themselves very directional and utilizes
raster scanning. In loudspeaker beamforming, usually, people want to listen to sound, so they
reproduce sound for people to listen to, and you can't use this, you can't have a loudspeaker array
of 20 loudspeakers and then play each one of them individually and then say to someone, okay,
now do the beamforming mentally, in your head. This is not good. You want to create a sound
picture, but here we don't care. We're creating an image. It doesn't have anything to do with
listening to actual sound, so we can do this. This is just the block diagram.
So what we do is we play a delayed version of the pulse through every loudspeaker, filtered by
corresponding correction filters, store whatever is recorded by microphones. In our case, this is
64 signals, and then we create everything from these 64 signals. This actually came out from a
discussion with an aide who suggested this for deconvolution. This can actually also be used for
deconvolution. I'll talk about it in future work. So these are some results. Brace yourself, and
then some future work that I think could be done in order to really, really improve what I'm
having. So this is the experimental setup at some time of the night in the atrium of our building
here. Cleaning ladies were not very happy that I was preventing them from coming here. I was
doing the experiment in the atrium, because with the current design, the beam is still too wide
and reverberation doesn't do me well when I'm in the room. So when I have a lot of reflecting
objects in the room, the images are worse. So that's why when I went to the atrium in order to
have a smaller number of reflections. I mean, the reflections are coming much later.
Okay, so results. I was imaging myself, and because this thing -- I didn't mention something.
Somehow, the underlying of all this is skeletal tracking. What we're trying to do with this is
we're trying to see whether we can design -- it's an exploratory project, whether we can design a
skeletal tracker using ultrasound instead of light. So I was standing like this in the atrium with
my arms folded next to my body, and this is the image that you get. So, first thing, remember -some people get with 400 microphones, and they do a really good job. But you understand that
this is actually not bad. It's not bad, because you can actually see that there is a body here,
beyond all questions of how much my body reflects ultrasound, beyond the fact that not every
reflection here is specular, but you can actually see something. So now this what I'm showing
you, this is an intensity image, which means that in every direction, you just calculate the
intensity. What we want to have are depth images. So a naive way to create a depth image is to
just, as I said, point the beam there and measure the time delay. I said this is not going to work
because of the sidelobes, and indeed, this is the depth image that you get with this naive
approach. I'm showing this to you just for the drama. So you see everything is at two meters,
almost everything, because every time -- I'm standing there, right. And every time some part of
the beam, some part of the sidelobes, will catch my body and will create a reflection, so I'll think
that everything is at two meters. After applying the sound source localizer to this, we get a much
more reasonable thing. So you see that now, only the pixels that I can trust are shown, and these
pixels are at correct distances, and they're sort of aligned with my body. Of course, this can be
improved further with further work, but this is a pretty large improvement from this. So now
let's see a more interesting thing, and this is kind of cool. So what happens if I spread my arms,
so arms are quite small, whatever, and they're not very good reflectors and specularity, whatnot.
So if I spread my arms, you can actually see it in the intensity image, which is kind of -- I had to
say, maybe four weeks ago, I lost hope for this, but now I'm really happy that I got this image.
This means that you can see where the arms are, so imagine having much more microphones,
imagine having better algorithms. You could probably do very well. So you can actually see
where the arms are. Again, just for comparison, this is with arms folded and this is with arms
spread, so you see arms here and here. That image, again, naive approach, everything is at
around two meters. Not very good. If you apply sound source localization now and it's good, we
get something like this, so you get this alien-like creature which, with some imagination, is me in
that spectrum, but this actually shows something, some body and some arms spread, so it's much
better than this. And it's actually pretty good, because the distances are correct and the arms are
in the correct positions. Reasons why this works is that, on the body, you have many kinks,
parts, small details that actually does reflect sound in different directions, so you can hope to find
something around here that will reflect sound back to your device.
And what I'm showing here is the same thing, but with this fast-frame, single-shot acquisition,
with linearity, again, with my arms folded. This was a different experiment, so this is not the
correct photo. I should have put on a different photo from that experiment, but there were no
people in the atrium that could take a photo of me at that time, so I had to give up. But you see
that the image is somewhat worse. I would expect the image to be exactly the same as the other
one. Unfortunately, for some reasons that I haven't been able to debug yet, the image is slightly
worse, but still it's sort of correct, right? And again, everything is at -- I was standing a bit
closer, at a bit less than two meters. In the naive approach, where I sort of recreate the beams, if
I apply sound source localization, then you sort of sea the head, which is a strong reflector, and
you see arms. This requires further work. It requires -- it's not quite clear to me why this didn't
work as expected, but in theory it should, so this is probably just a bug.
So these are the results, and I'll go over some of the conclusions. So what did I have to do? I
had to design this hardware, with a lot of help from Jason. I had to -- this involved a lot of
things, like how to interface it to a PC, definitely how to design the geometry. So I explored
various image creation algorithms, and the images that we have, they have three-degree angular
resolution. You could create finer images, but with current beam shape, it doesn't make much
sense, right? They wouldn't have much physical meaning, unless you were able to properly
deconvolve them. So three-degree angular resolution is somehow a qualitative, good number of
pixels, so I propose some convex optimization based and some source localization based
algorithms to improve the depth images, to improve the actual depth estimation, and so one of
the major problems, the speed of sound, the proposed solution. The experiment, it still needs
some work, but definitely, the solution is to use multiple transducers, but to use these simple
properties of LTI systems that I can do things offline, even in transmitting. And so this takes us
from one-twelfth of a frame per second for the naive approach to 30 frames per second for eight
transducers for this approach, with transmit beamforming online. And the good thing about this,
the great thing about this, is that this doesn't scale, right? The time, if you fire only eight
transducers and get more sophisticated algorithms, you'll still need exactly one-thirtieth of a
second to acquire the frame, because you only need to fire eight transducers, so it doesn't scale.
See what I'm trying to say? So you can have very sophisticated algorithms after this, but the
image acquiring is going to be very fast, so the limit is the sky.
The major advantage of this thing is low power consumption. Again, things operating lasers,
they use a lot of power. It does depend on the number of pixels, but typically they use a lot of
power. In this regard, our prototype uses three watts. This is a lot, and this is because of the
nature of the operational amplifiers that we're using. The currents are quite high, but if you do
the design specifically for these transducers, you could achieve something like 48 transducers at
once, like one watt impulse, and perhaps 10% of duty cycle. You're not using them all the time.
This results in maybe 100 milliwatts of the power consumption. Then, if you go back here and
you do a single-shot acquisition, when only one transducer is active at a time, you could actually
perhaps divide this by five, so get something like 20 milliwatts, which is a huge difference if you
compare it with what you need for lasers. Microphones virtually don't need any power.
Okay, and so the question, I didn't highlight enough in the beginning of the talk, it's like
ultrasound for skeletal tracking. Well, I think that probably the answer is yes. If you use higher
frequency and you play with the algorithms, you have proper calibration, I'm pretty sure that you
can extract enough meaningful features about positions of lims in order to create a skeletal
tracker, and so let me go just quickly to some possible future directions. So, definitely, there is
value in B-mode or intensity images. So perhaps we could use these images to maybe detect
connected regions or something like that, and then use them to enhance the depth images. So
there would be value in combining the two information, to build like a reasonable skeletal
tracker.
And then, for the actual skeletal tracking, well, what is the information that we use? The
information that we use is 64 signals that we recorded. So perhaps the important features are
times and intensities in every temporal bin of these of 64 signals. That's what matters. So what
you could do is you could divide these 64 signals into bins, compute the average intensity over
every bin and feed these as features for the training of the skeletal tracker, right? You see what
I'm trying to say. So you could not create the depth image at all. You could completely bypass
the depth image and just get the skeletal tracking from these features, which directly encode,
actually, the positions of objects in the scene. Okay, we can formulate the deconvolution that I
was proposing directly in the signal domain, never creating beams. So we can also formulate
this deconvolution starting by eight-by-eight signals that we have, similarly for the other thing,
and of course, the most mundane idea that we have is to increase the number of microphones.
So, nowadays, the guy is Przybyla, one of the papers that I've shown in one of the first slides,
they use something called capacitive micro-machined transducers, and these things you can fit -for example, they fit 37 microphones on a PCB 6.5 by 6.5 millimeters. It's quite fascinating.
The problem currently is that they work at slightly higher frequencies, so their range is limited,
but I think in the very foreseeable future, this is going to be the technology of choice for
ultrasonic imaging in air. So just as an example, what could you do if you had 64 microphones?
So this would be the beam shape, right, and this is the beam shape that we have currently. So,
obviously, virtually no sidelobes. It's still not a point. It's still some piece of a beam, but it's
much better than what we currently have. So many of the problems will be mitigated with
having more microphones.
Okay, well, that's roughly everything that I had to say. Thanks for listening. If you had any
questions, please go ahead.
>>: I would like to ask some questions. You're trying to take one picture using the ultrasound
and turn it into skeletons, right.
>> Ivan Dokmanic: For example, yes, but you'd do it consecutively.
>>: And how many frames did you use to create those?
>> Ivan Dokmanic: These are all single frames.
>>: But do you think it will get better if you use multiple frames?
>> Ivan Dokmanic: I would say yes. No is obviously wrong answer, so it can get worse, so it
will get better, because it will have some modeling over time, so you could probably use this also
to get higher resolution.
>>: And it just occurred to me. You're using a single pulse, right? Single frequency pulse, like >> Ivan Dokmanic: We use a band, like from 38 kilohertz to 42.
>>: Yes, but do you think it's possible to shift that -- like make the beam a little smaller and shift
it into multiple frames, so for instance, on the first frame, you shoot from 38 kilohertz to 39
kilohertz, on the second frame you shoot 39 to 40, something like that.
>> Ivan Dokmanic: OFDM kind of ->>: You get a similar chirp-like signal, but it's not exactly a chirp, but it's like a beep.
>> Ivan Dokmanic: Yes, definitely. We were thinking about things like that. One thing that
you could do for sure is use multiple frequencies, not like 38, 39, but try to use 40 kilohertz and
maybe 80 kilohertz, or 40 and 73 -- I don't know. And then this could be really useful to remove,
for example, coherence speckle or some effects that are the artifact of the actual frequency that
we're using. So things that you're suggesting would be great, also, for improving the frame rate
further, in the following sense. If you could use this kind of OFDM idea, if you could use
separate bands, you could actually emit one pulse -- so one beam in one part of the spectrum,
other beam in other parts of the spectrum and other beams in the third part of the spectrum, all at
the same time. So that's a good idea, but it requires a lot more design, and you need a slightly
more wide band. So I had a lot of trouble with this thing, with repeatability of these transducers.
Sometimes, they don't work as expected, so you need a more reliable source of ultrasound that's
more wide band, because these things are very narrow band. It's 40 kilohertz and a bit a side, so
we need to push things a lot, right? Maybe you'd like to have something that has a much wider
bandwidth, and these things exist.
>>: More questions?
>>: Two questions, actually. First, in terms of what [indiscernible], I'm most interested in a
dynamic interacting [indiscernible], so can you exploit -- can you exploit the difference from
frame to frame? So you issued that as one frame, and so I don't even care what's here. When I
do that lag, it's like what's difference now? These are the parts and then try to figure out which
parts of that are different. That's the part that's using --
>> Ivan Dokmanic: Definitely, yes. I think that's a good idea. It's actually now that you're
saying this, this kind of reminds me of maybe someone is familiar with face decoder [ph] or
something like this. You're trying to determine a real frequency. You have an FFD of
something, and you're trying to determine the real frequency, right? So the way to do it is to take
two consecutive frames and see the phase change and interpolate the frequency. And then you're
not limited to the beam anymore. You can get the real frequency. So, yes, I think this could be
really useful, because also an intensity image could be used for that, right, because some
intensities would slightly go down, some intensities would slightly go up, so you could probably
interpolate, play on this, to get a much better image.
>>: And the second question is [indiscernible] position of that matrix, of getting the various
functions. So one of your hypotheses was there was only one single [indiscernible] from
uncertainty [indiscernible], but due to multipath that's actually not true, right?
>> Ivan Dokmanic: Well, if you really have a really narrow beam, that you can really listen to
this particular pixel, and then what has to happen is that you have a multipart component that
comes from the exact same direction, and this is very unlikely to happen. Well, because it's
probability zero. In the limit case, where you have infinitesimal pixels, if you're reasoning
exactly to this direction, then to have also multipart component coming from exactly this
direction is probability zero. A lot of random things should align in order to have a couple of
reflections and then suddenly a reflection again from there.
>>: But the probability, isn't it the same that something comes back to that?
>> Ivan Dokmanic: It is, but infinitesimal pixels for me are -- okay, perhaps I didn't understand
your question, and I see what you're saying. So, for example, if I have a mirror, how will I know
that it's a mirror, that it's not something over there. That's a good question, and the answer is,
without further information, there's no way to tell it. So if you're just using this information,
there's no way to tell. You should be able to estimate the positions of such big reflectors. One
way to do it is -- it would probably be too slow, but since this is all modeled by image sources,
actually, one way to do it would be to try to combine pairs of pixels to get a third pixel. How to
say this? So this reflection from that wall will also have a corresponding object there, so they
have a clear geometrical relationship, so the object that creates the reflection will also be
hopefully inside the scene. So since this object, you have seen this object and you have seen its
reflection, but maybe you know that there is a wall here. There's this geometrical relationship
between the reflection and the real object, and you can detect the reflection by exploiting this
image source relationship. So you can say, okay, ah, this guy is actually a reflection of this guy.
I'll just burn it out. That would work, but it would probably be too slow, because you need to
combine pairs of these. But that's one idea on how to remove this thing.
>>: Other questions?
>>: The problem of the sidelobes picking up reflections where those are [indiscernible]. Did
you consider treating it as a filter design problem, whereby at the expensive of increasing the
size of the main lobes so that you could suppress the sidelobes and reduce the probability of a
reflection falling on the sidelobe, but at the same time reducing the resolution because of the
[indiscernible].
>> Ivan Dokmanic: Actually, I have it. That was initially one of the ideas, to sort of design the
beam pattern, then, later, we didn't really do it. The reasons are we hope that the sidelobes are
not going to be so loud, and then later I just forgot about it. Then, the main lobe is still quite
wide, and if we did it like that, then it would be even wider, considerably wider, probably. So it
would be beneficial -- okay, so the main problem with sidelobes is with specular reflection. So
I'm trying to get something from there, and say that I'm hoping -- I don't know why, but I'm
hoping for a small, diffused reflection from something over there, and here I'm getting a strong
reflection. So this diffused reflection is, for example, 1,000 times weaker than a strong specular
reflection from there. And I doubt that, especially with eight microphones, you can create a
beamformer that will suppress everything but the main lobe more than 1,000 times. I'm sure that
you can't.
You see what I'm saying? It's like the problem is that if there is a strong reflector in the scene,
and even if you're shooting just a bit of sound or releasing just slightly to this direction, you'll
hear everything coming from this strong reflector. One reflector in the scene has a reflectivity
that's 1,000 times stronger than the other things in the scene, so maybe some fabric and a piece
of wood. I'm just talking gibberish, but then everything would seem to be coming from this
piece of wood. I'm not sure if I'm clear.
>>: If you were to practically deploy it to say in a living room, where there are reflections that
you expect always to be there, you wouldn't expect to have [indiscernible], could you have a
calibration procedure that determined where they're likely to be, make sure that the nulls of the
sidelobes are pointing in the directions where you actually ->> Ivan Dokmanic: Definitely that's a great idea. Yes, I didn't think of that, and this seems like
a really practical and good idea. Why not? So we just sort of move out of the way, calibrate,
and then go and use it. This is actually good, especially with more microphones and more
transistors. You could certainly play on that.
>>: In some senses, that's a bit like MVDR, in the sense that if you walk out, then your noise is
your unwanted signal, which might be reflected. You first try to minimize that such that when
you actually come back in, then the only thing that's in the room is the wanted signal, and that's
going to be you.
>> Ivan Dokmanic: I like this idea. I think this could be really good.
>> Ivan Tashev: More questions?
>>: Can you increase the microphones without increasing the number of transducers?
>> Ivan Dokmanic: Of course. That would be something that you could do. So there is benefit
in having more than one transducer, and the benefit is the following. You can sort of partially
isolate regions of space, which is good for sound source localization. I can elaborate on that
later, but definitely, what you would do is you would not increase the number of transducers too
much, but you would increase a lot the number of microphones.
>> Ivan Tashev: More questions? Okay, let's give thanks to our speaker today.
Download