Sing Bing Kang: It's a pleasure for me to... graduated with a Ph.D. robotics at CMU. And he's...

advertisement
Sing Bing Kang: It's a pleasure for me to introduce Mohit Gupta. Mohit
graduated with a Ph.D. robotics at CMU. And he's currently at Columbia where
it's a research scientist and will be assistant professor at University of
Wisconsin-Madison starting January of next year. So Mohit is an expert in
computational imaging. More specifically, for example, in computational
cameras, where he's interested in coming out with systems that are robust
under very challenging situations. Mohit just told me that one of his work
in structured light has been licensed by a Japanese company. So that's
great. So Mohit.
>> Mohit Gupta: Thanks. Thanks Sing Bing for the introduction. And thanks
a lot for inviting me. I'm going to talk about 3D cameras. Now, 3D cameras
have a lot history. This is perhaps the first known 3D imaging system. It's
called photo-sculpture and was invented by a French sculpture called Francios
Willeme. This consisted of a large room, something like this, and the
subject who needed to be scanned would sit in the center of the room. There
were a lot of cameras on the walls of the room. And each of these cameras
would take an image, which were then manually put together to create a
physical replica of the subject. Now you can imagine this is not a very
portable system. This is not very fast. And this did not get very high
resolution 3D shape. We have come a long way since then. Current 3D cameras
can capture very high resolution 3D shape in near realtime. 3D imagining
techniques, at least most of them, can be broadly classified into two
categories. First are the triangulation based systems like studio or
structured light. This is a typical structured light system. It consists of
a camera is the projector. The projector projects a known structure
illumination pattern on the scene. The camera captures an image. And
because we know the pattern, we can find out the correspondences between
camera pictures and projected pictures which we can then use to triangulate
and get the scene decks. For those of you who know studio, this is basically
a studio system but instead of two cameras, we have one camera and a light
source. The second category is time of flight. Again, a time of flight
system consists of a light source and a camera. Now, in the simplest form,
the light source emits a very short light pulse into the scene. The pulse
then travels to the scene and then comes back to the sensor. There's a
high-speed stopwatch that measures the time delay between when the pulse was
emitted and when it was received. Now, because we know the speed of light,
we can compute the depth or the distance by this very simple equation here.
There's a factor of two here because light travels a distance twice, back and
forth. Now, this is one classification, but another way to classify 3D
cameras is active versus passive. For the purpose of this talk, we're going
to focus on active cameras which use a coded light source to illuminate the
scene. So both structured light and time of flight are active imaging
techniques. They're both active imaging techniques and now currently form
the basis of after large fraction of 3D cameras, especially in consumer
domain. You guys all know about the Kinect. Now soon we will have these
devices in our cell phones and in our laptops. Because of their low cost and
small size, these cameras really have the potential to be the driving force
behind a lot of new imaging applications. So think about augmented reality,
robotic surgery, autonomous cars, industrial automation. Again, so there's a
lot of potential. But in all these scenarios, these cameras will have to
really step out of the laboratory place, out of their comfort zone, into the
real world where they face a lot of challenges. So here is a 3D camera. It
has its source and its sensor. Now, it is okay if this light source is the
only one that is illuminating the scene. But in outdoor scenarios,
especially for autonomous cars, this light source has to compete with
sunlight, which is much, much brighter. So here's an example. This is an
example image and the 3D image captured by time of flight camera. But this
is just around dawn, around 6:00 a.m., so the sunlight is not very strong.
Now, I created a time lapse review by capturing an image every ten minutes.
So see what happened as the sun rises. Wherever there is constant light, the
camera cannot measure 3D shape. Another media challenge is if there is some
kind of scattering media present between the camera and the scene. Again,
this scenario arise a lot in autonomous cars or even in medical robotics when
there is stuff like fog, smoke, or medical -- or body tissue. So here's an
example. This is the video of a self-driving car. First, this is on a clear
day. Now, the car approaches a turn and it's nice and clear to it makes a
turn successfully. Now, the same car, the very next day, when there is fog
and rain. So watch what happens as the car approaches the turn. It cannot
make it. It goes into the curve. So this is clearly not acceptable. We
would like to avoid such a thing. Now, these challenges are due to
environmental factors. These are something which are external to the scene.
But even if we are in a controlled environment, let's say indoors, maybe on a
factory floor, there are challenges due to the scene itself. For example,
many of these cameras assume that light bounces only once between when it
leaves the source and gets to the sensor. Now, this is fine if the scene is
perfectly convex. But in general, light will bounce multiple times. So
think about this room or think about the robot which is exploring an
underground cave. Light is going to bounce multiple times. And finally, if
the scene is not made of nice well-behaved diffused materials but something
more challenging like metal, glass, or something translucent like skin,
again, these, cameras cannot measure shape reliably. So if you look at the
evolution of 3D cameras over the last 150 years, we have made a lot of
progress in terms of speed, size, and cost, but I believe that the next
generation of 3D cameras, if they have to make a really meaningful impact in
our life, they'll have to perform reliably in the wild. And by in the wild,
I mean in every environment and for every scene. So now, I'm going to talk
about some of our work into dealing with these core challenges. And I'll
start with ambient illumination or sunlight. Now, the first idea, the very
first thing that will come to your mind for dealing with ambient illumination
is why don't we capture two images? One where the light source is on so that
the total light is the sum of I-source and I-sun. And then we capture
another image where the light source is turned off or blocked so that the
camera captures only the sunlight. And then by simple subtraction, we can
remove sunlight. Now, it would be nice if this simple approach works, but
unfortunately, it does not. The previous example that I showed was after the
subtraction. And the reason why this approach does not work is because
there's a more fundamental effect at play here because of particle nature of
the light. Now, when a camera captures measures of light sources measures
light intensity, it does not measure it as a continuous wave but as a
discrete of particles called photons. And the arrival of photons, it's not
uniform. It's a discrete random process, which is modeled through the
discrete random process which means that if the mean rate of arrival, let's
say, is three photons [indiscernible] time, the camera will measure three
photons. It may measure six photons, or it can measure zero photons. So
there's an inherently uncertainty associated with any light measurement. We
express this uncertainty by using this random variable called photon noise.
So the actual measured value is the sum of the true value and this random
variable photon noise. Now, this is a discrete process so it follows a
Poisson distribution. But the important thing that I want to remember is
that the standard division of this photon noise is equal to square root of
the mean value. So the more the light captured, more is uncertainty. Yes?
>> This is more of a problem if you have [indiscernible] time is very short,
right?
>> Mohit Gupta: This is more of a problem if your [indiscernible] time is
too short, but it's also a problem if the light that you're sending out from
your light source, which is carrying useful information about the scene, if
that is much smaller as compared to ambient light because ambient light is
also going to contribute this photon noise, but it's not carrying any useful
information.
>> But photon noise is an unavoidable limit, but then you also have
implementation issues like, you know, the amount of charge that a bucket can
hold, so the dynamic range, readout noise, there's a whole bunch of things.
>> Mohit Gupta: That's exactly true. Now, the reason I'm focusing on photon
noise is because, as you said, this is a very fundamental property of light,
and no matter how good your camera is, even human eye, you cannot avoid
photon noise. So let's see how photon noise affects 3D imaging. So now, you
have this camera and the scene is also illuminated by sunlight. Now, the
total measured light has three components. One is the source light, then
there is a sunlight and then there is the photon noise. Now, we call this as
a signal component because this is the only one which is carrying useful
information about scene depths. Now, as we said before, the standard
division of photon noise is equal to square root of I source plus I-sun.
Now, we are looking at scenarios where sunlight is much stronger so we can
prescription [indiscernible] I-sun. Now, the quality of three D shape that
this camera measures is given by the ratio of the signal component and the
noise. If you recall the signal to noise ratio, and in this case, this is
I-source or [indiscernible] device. Now, immediately here, we can see the
problem. If sunlight is much stronger than source light, then you're not
going be able to measure 3D shape reliably. So how does this ->> Are you going to say something about wavelength effects?
narrow band filters?
You know, using
>> Mohit Gupta: So we can increase this ratio by using wavelength filters.
But as I'm going to show, that increases this term by a factor of about one
order of magnitude. Now, in practice, we are looking at a gab of about 3 to
5 orders of magnitude. So even if we do this filtering, and in all our
experiments, we do that, but still we need to get something more. But that's
a very good question. So yeah. Let's look at what this ratio looks like in
practice. So here is a plot of sunlight strength measured over a typical
day. So X axis is the time of the day and Y axis is the ambient illumination
strength. This is large scale. And these are the images of the sky at
different times of the day. But this is the range of the strength of
artificial light source, typical artificial light sources. If you have a
very bright spotlight, you're somewhere around here. But if you have a weak,
maybe a pocket projector, you're somewhere around here. So in all these
scenarios, the artificial source is about 2 to 5 orders of magnitude weaker.
And that's what makes the problem so challenging. So how do we deal with
this? So now that we have established that the problem is photon noise,
which is a random variable with a high standard deviation, the first idea is
to just capture a lot of images and compute the average. Now, we know from
36 that the standard deviation of the mean of a bunch of random variables is
lower than the individual deviation. So particularly if you capture images
and take the average, the standard deviation of the average is going to be
lower by a factor of [indiscernible] ten, which means the SNR is going to be
higher by a factor of [indiscernible] ten. So if you plot the SNR as a
function of time, we get -- there is growth, but there's a very slow
[indiscernible] growth. Meaning suppose we have increased SNR by a factor of
ten, we have to capture a hundred times more images. And that's not really
practical. Especially in outdoor scenarios where we have to make decisions
in realtime. Now, the problem with this approach is that you're starting
with a light source that has a small amount of power, and then you're
diluting it even further by spreading it out over the scene. And then you
have to compensate for that by doing this average and post processing. So
what if we could increase the signal strength before the images are captured?
So instead of spreading out the light source power over the scene, what if
the we could divide the scene into end parts and focus all the available
light into one part at a time? So we have still capturing N times more
images, the same as averaging approach. But now the SNR in the signal
increases by a factor of N in the same total time. So in the same total
time, there's a much faster linear group. So the next question is if
focusing light is good, then why don't we concentrate all the available light
into a single theme point at a time? And that is the approach taken by many
commercial lighter systems. It works in that the signal strength is very
high, but it's a very slow process. You have to scan over the scene one
point at a time. Now, the other extreme is of course the averaging approach,
which we talked about. That's also very slow. Now, our main observation,
our main contribution here, was to identify that these are just two extremes
in the space of light space. This is this whole unexplored space in the
middle here, and the optimal light straight, some lie somewhere in between.
Now, intuitively the spread should be small enough that we don't have to do
too much averaging and it should be large enough that we don't have do too
much scanning. So we've derived the [indiscernible] expression for the
optimal light strength. Here [indiscernible] depends on the light source and
sensor characteristics. But the most interesting thing here is that the
optimal spread actually depends on the amount of sunlight. Which means that
if we have different scenarios, let's say, sunny day, cloudy day or night,
the optimal light spread is going to be different. Decreasing sunlight,
increasing light spread. Now, this is an interesting design. We are saying
that the optimal camera for an outdoor scenario is the one that really adapts
to the amount of sunlight. Now, this is fine [indiscernible] but then a
practitioner might ask how do you implement such a light source? Now,
remember that we don't -- we cannot do this by blocking light. We don't want
to lose light. So we really have to change the spread of light reactively to
the environment. Now, if you look at this laser pointer, this is focusing
all the light into a single point. If you look at this projector here, this
is spreading out all the light. So there are fixed light spreads. How do
you implement this light source with variable light spreads? We solve this
optical problem by making this mechanical apparatus. Here's a polygonal
mirror and there's a laser diode. There's a cylindrical lens. And as the
mirror rotates, this laser sheet is swept across the scene. This apparatus
is typically used in laser scanners, and typically this mirror rotates very
fast so that in the time that it takes for a camera to capture an image, the
sheet has swept across the scene. Now, our main idea here was that if you
could somehow control the rotation speed of the mirror, then we could
achieve -- we could emulate a light source with different amount of light
spreads without losing any light. So for example, if the mirror rotates
slowly, then this sheet sweeps through a smaller area, [indiscernible] point
that's a lot of light. If the mirror rotates fast, then the same sheet for a
larger area, but [indiscernible] receives a small amount of light. Now we're
not losing any light here. So these are some example images captured by our
system. Here the scene was just a single flat white wall. From left to
right, the rotation speed decreases. And if we plot the intensities long
this horizontal scan line, we can see that from left to right, the spread
decreases but the peak intensity increases. The area [indiscernible] for all
these is the same. Just a total amount of light going into the scene. Now I
want to show some real experimental results with the setup. So here's an
example object placed outdoors. Very strong sunlight. Around 75,000 lux.
In comparison, the total brightness of the light shows was close to 50 lux.
So again, maybe three to four orders of magnitude difference. And Rick, to
answer your question, we used a spectral filter to block out some of the
light, which still we need to. This is the result that we get with the frame
averaging approach. We see a lot of holes. It's incomplete. This is the
result that we get with the point scanning approach. Now, I want to explain
this result because we used a fixed time wizard for all the methods. So
because we have a small -- we're capturing a small number of images, we have
to subsample the scene with the point scanning approach, and that's why we
see this blocky result. All the surface details are lost. With the same
number of images, and with the same power, this is the result that we get
with our method.
>>
What happens if the scene contains like [indiscernible] not very dark?
>> Mohit Gupta: Yeah. That's something which we're thinking about right
now. Right now we assume that the entire scene is lit with the same amount
of ambient illumination. And if there is a very strong shadow edge, let's
say, then ideally you like do something different for different parts of the
scene. And we don't have it right now, but that's a very good future
direction.
>> You're using a rotating mirror.
deflect things?
>> Mohit Gupta:
But could you use a MEMS device to
You could do that too.
>> And then the cylindrical part which spreads vertically, could that be a
deflector as well?
>> Mohit Gupta: You could have it to the deflector, like [indiscernible].
And that will probably give you more flexibility. This was kind of the
simplest implementation, and here, we didn't -- this device that we built
actually looks like an exotic device, but it's available off the shelf. The
only thing that we had to do was change the speed.
>>
So [indiscernible].
>> Mohit Gupta: In this case, I believe we used about 30 images in total.
So the acquisition time was close to one second. Yeah. So now that we have
this device, we went around Columbia campus scanning all those structures.
So this is a marble statue around noon. Again, very strong sunlight. We can
still recover very fine details here. This is a church inside Columbia
campus. Again, strong sunlight, we can still recover high level details.
Now, getting this kind of fine details can be useful for [indiscernible]
navigation, but also in virtual reality and digital tourism applications
where we want to achieve highly detailed 3D structures of large scale
buildings and cities. So just so summarize this part, we looked at the
photon noise on 3D imaging and we showed that the optimal thing do is to
adapt the spread of light before -- according to the environment and that
this should be done before the images are captured. So next I want to talk
about the problems of scattering and dealing with difficult geometry. And
the underlying physical processes are very similar. So I'm going to talk
about them together. I'll start with geometry, actually. So suppose you're
an architect and you want to build 3D shape this have room here. And suppose
you are using a time of flight camera. Now, you will be okay if there is
only this direct reflection. But now because this is an enclosed scene,
there will be all these indirect light paths which are called
interreflections. Now, these indirect paths are longer than the direct path.
So because of that, the camera will overestimate the scene depths. So here's
a simulation. This is a camera view. This is the ground truth shape along
this horizontal scan line. It's a square. This is a shape [indiscernible]
using a conventional time of flight camera. There are two things you notice
here. One, the areas are quite large. The scale of the scene is about three
meters but the error is about one meter, which is not acceptable in many
settings, in road navigation, in modeling, et cetera. The second thing to
notice is that this is not a random [indiscernible] kind of error. It's a
very structured error which cannot be removed in post processing by simple
filtering. So you have to do something again before the images are captured.
So how do we deal with this problem? First let's look at the image formation
modeling in a bit more detail. Now, this is a time of flight camera but we
are going to use the continuous model of time of flight where the light
source does not emit a slight pulse but emits light continuously. And the
intensity of the source is modulated over time. So for example, it could be
a sinusoid over time. So this is a light source intensity on Y axis. X axis
is the time. Now, this light ray travels from the source to the scene and
then comes back to the center. Now, the light received at the center, it is
also a sinusoid over time but it's shifted. The amount of this
[indiscernible] shift is proportional to the travel distance. And we can
measure the phase shift and hence we can measure the travel distance. That's
how continuous wave time of flight cameras work. Now, suppose the scene is a
single flat plane. The camera, each camera picks up a receiver single light
part and there will be a sinusoid corresponding to that. But if you made the
scene a bit more interesting, a bit more realistic, this camera pixel starts
facing light along this indirect path as well. Now, this path has a defined
length so the sinusoid corresponding to it will have a different phase.
There's another light part [indiscernible] sinusoid different phase. Now,
eventually the counter integrates all these rays, which means all these
sinusoids are going to get added together, will get another sinusoid which
now has a different phase than what it should be. And this phase error is
what result in the depth errors. Now, interreflections are an important
problem in vision. They have been looked at in many different context over
the last 30, 40 years in photo metric studio and structured light. They've
received a lot of attention recently in the time of flight community as well.
But most existing approaches assume that it's indirect light coming to the
camera comes along 2 or 3, a small direct number of indirect light paths.
But if you think about it in general, there can be an infinite number of a
continuum of light paths. And each of these paths is contributing a sinusoid
here.
>> The time graph is a little bit misleading because this assumes like a
perfectly mapped reflector, and even then, it's fudging stuff, right? I
mean, reflectors with a lot of gloss have a fairly narrow load.
>> Mohit Gupta: That's true. That's true if you have perfect mirrors in the
scene, then the sparse approaches would be better. But here we're assuming
that the scene is -- it has broad reflectance looks. Yeah. So that's a good
one. Yeah. So now, the challenge here is to be -- is to somehow separate
this infinite number of sinusoids from this direct component. And without
any [indiscernible] formation, it kind of seems very challenging. Now, the
key intuition that we have is that while we cannot prevent this indirect path
from happening, if you can somehow ensure that the sum of all these indirect
sinusoids becomes a constant DC and [indiscernible] how, but if somehow we
can achieve that, then the total light, which is the sum of the direct and
indirect light, will just be [indiscernible] shifted version of the direct
light. Which means we will have a different offset, but the phase will be
the same. It will be okay. So how do we achieve that? Now, before we look
at the solution, I want to first talk a little bit about how do we represent
at least light trace. So far we've been talking about sinusoids, which are
kind of a clumsy representation. Can we do something better? Light parts,
one direct and one indirect light path. And they have different sinusoids.
They may have different phases. They may have different amplitudes but the
important observation is that the period or the frequency is the same. Which
is the same as that of emitted light. Now this is kind of -- this is not
very surprising. All that we are saying is that if we have a light source,
which is sending out light into the seen at a particular temporal frequency,
the light [indiscernible] may bounce around multiple times [indiscernible]
absorb, but the frequency will not change. So if the frequency remains
constant, we can simply factor it out. The only two parameters that remain
are the amplitude and the phase. So we can represent each light ray by a
complex number or a phasor. So it's a more compact representation. So now
that we have this phasor representation, we can take a computational view of
light transport. We can think about the scene as a black box which takes as
input a phasor that is emitted by the and gives as output another phasor
which goes into the camera. And this transformation between the input and
the output is a simple linear one where this constant here is called a light
transport coefficient. This is what [indiscernible] properties. This is a
very simple linear representation. So now we can use this representation to
start analyzing different light transport events. So different events that
light undergoes in a scene. So suppose we have this light source, there is
some initial amplitude of the emitted light. There is some initial phase.
Suppose light travels through free space. Now, we know that light does not
lose intensity. Only the phrase changes. So the corresponding phasor
transformation would be simple rotation. And suppose light gets reflected.
Now, at the moment of reflex, light only loses amplitude. The phrase does
not change. So there's a reduction in amplitude here. Now, suppose light
goes through and kind of scattering medium where light is absorbed as well.
So as light travels, both the phase and the amplitude change. So the
transformation here would be [indiscernible]. And finally, a lot of light is
get added together. Each of them have a corresponding phasor here. A total
light after superposition would be just the complex resultant of these
phases. So far, we have looked at a single frequency. Now let's look at
what happens as we change the modulation frequency. We're going to look at
the simplest case of free space propagation. So again, we have this initial
phase of amplitude [indiscernible] propagation, the phase changes. Now, the
amount of phase changes proposed to the travel distance. But it is also
proportion to the modulation frequency. Which means that for the same travel
distance, as we start increasing the modulation frequency, the amount of
phase change increases. Now, this is an important property. I want you to
remember that. I'm going to use it very soon. So now that we have this new
kind of visual representation, a new kind of visual algebra to analyze light
transport, you can use it to analyze interreflections. So this is a very
simple case. There's one single indirect light path. And there's a
corresponding phasor here. Next we look at another light past which is an
enclosed neighborhood. Now, because light transport is locally smooth, the
amplitude of this light path would be very similar to the previous one. But
the path length is different, so the phasor is slightly different. So now,
we look at a small continuum of light path starting in this small cone here.
All of them have approximately the same amplitude but slightly different
phases. So this phasors will trace out this sector in the phasor space.
Now, this is where we use the fact that the Angular spread of these phases is
proportionate to the modulation frequency here, which means that as we start
increasing the frequency, the spread increases and these phasors start
cancelling each other out until eventually it becomes zero. So one more
time. We increase the frequency until eventually we reach a threshold where
the indirect component becomes zero. So that's the main idea. That's the
main contribution. Just to show that if you use a high enough temporal
frequency, then we can deal with the problem of interreflections even in the
general case when there can be an infinite amount of light bounces. So yeah,
this is what we started out to achieve. The interreflection component
becomes a DC. The direct component is still oscillating. And the total
component is just a DV shifted version so that the phases, so that only the
offset is different but the phase is the same. So we are nearly done here.
We have almost solved the problem, but there's one small issue that is
remaining. And that is because if you use a high temporal frequency, we get
a lot of depth ambiguities. And that because if we have two different sim
points, for example, this one here, the phase, this is the phase for this sim
point, if we look at other sim point which is further away, the phasor wraps
around after [indiscernible] and we get the same phase. Now, this is
fortunately this is a very well studied problem in many different fields,
including [indiscernible], in acoustics, even in [indiscernible] where the
idea is if you use two high frequencies that are very close to each other, we
can emit at a low frequency. This is kind of similar to emitting a big
frequency. So based on these ideas, we can maybe use two high frequencies
that are very similar to each other. And we can estimate phase of both of
them and use that to [indiscernible]. So based on this, we have developed a
method called micro time of flight imaging where we use two high frequencies
that are very close to each other. We call this micro because both
frequencies are high. And the periods are small or micro. Now, if you
compare with conventional time of flight, it uses only three measurements.
It uses a single low frequency. Three measurements. We need three because
[indiscernible] three unknowns, offset, amplitude and phase. Now, micro time
of flight uses one extra measurement but provides significant robustness
against interreflections. So next I want to show some results for some
simulations. This is a Cornell box.
>>
How do you decide how different this frequencies are?
>> Mohit Gupta: That's an engineering decision. Ideally, theoretically you
want them to be as close as possible. But there are practical limits imposed
by the light source. Light source has a finite frequency resolution. So
we -- if we know something about the [indiscernible] like maybe we know an
approximate range of scene depths. So based on that, we can do this kind of
tradeoff. So this is a Cornell box. Again, it will be okay if there is only
direct reflection but there are all these indirect bounces as well. This is
a shape, this is a ground truth shape. This is a shape using conventional
time of flight. And with one extra image, this is a shape recovered using
micro time of flight. It's about a two-hour of magnitude improvement. So
now, based on these ideas, we developed an experimental setup. This is a
light source. It's a bank of laser diodes. And the camera that we use, it's
a PMD CamBoard nano. [Indiscernible] in Germany, which makes these
customizable time of flight cameras. And we needed this to be customizable
because we wanted to change the frequencies. This is the experimental scene
here. It's a conical scene for [indiscernible] interreflections. It's a
corner scene. There are two phases. There's light bounces between these two
walls. And we kept the light wall to be moveable so that we can change the
apex angle to change the amount of interreflections. This is very
[indiscernible] setup here, the fixed wall, the moveable wall and the camera
here. These are three measures of the setup for different apex angle, 45,
60, and 90 degrees. This is the shape recovered using micro time of flight
for the 45-degree [indiscernible]. Now, probably a comparison is more
instructive. This is a ground truth shape. This is the shape using
conventional time of flight, the mean area is about 85 millimeters. The
depths are all estimated. And this is a shape using micro time of flight.
Again, the errors are about one to two orders of magnitude lower.
>> [Indiscernible] continuous time of flight, right?
wouldn't have this issue?
>> Mohit Gupta:
So post time of flight
That's right.
>> And so what is the benefit of using continuous time of flight over
[indiscernible] time?
>> Mohit Gupta: It's mostly a cost argument. Now, in first time of flight,
what you mentioned first time of flight, those systems are extremely
expensive.
>>
[Indiscernible].
>> Mohit Gupta: Mostly lighter approaches, like maybe the one that is used
in Google [indiscernible]. Now, in a system itself is three times more
expensive than the current [indiscernible]. These cameras can be as cheap as
hundred dollars.
>> Your straw mat here is a single frequency, right? But the, you know, the
biggest consumer time of flight continuous wave time of flight which is the
you connect one uses multiple frequency, right? So the issue, I worked with
that team. The issue is often one about range ambiguity. So if you use the
high frequencies -- well, anyway, but in other words, this idea of using more
than one frequency isn't particularly novel, right? It's maybe the fact that
you're using two very high, very close frequencies. Right?
>> Mohit Gupta: The main novelty here is the use of high frequencies. Yeah.
So people have used multiple low frequencies as well. But each of those
frequencies will be susceptible to interreflections. So the [indiscernible]
part is not novel.
novel part here.
>>
But again, the idea of using high frequencies, that's a
So if you keep increasing the frequency, do you still get the DC?
>> Mohit Gupta: You do. And the reason for that is this path size or the
scene that I used for this derivation, that's a mathematical construct. It's
not a physical construct. So I guess your question is if your frequency goes
beyond that threshold frequency, would you still get the cancellation. You
do because you can still divide up your scene into smaller patches where in
each patch, you get the cancellation effect. So couple more comparisons.
This is 60 degrees and 90-degree wedges. This is ground truth. This is
conventional time of flight and this is micro time of flight. So just to
summarize this part here, analyzed the effect of interreflections for time of
flight imaging and we showed that by using high frequencies that are close to
each other, we can deal with the problem of interreflections. Now, it turns
out that the same kind of tools and techniques can be used for the other 3D
imaging technique which we talked about, structured light. So just to remind
you in structured light, we project spatially coded intensity [indiscernible]
on the scene. And one of the most popular structured light method is phase
shifting where you project sinusoidal coded intensity patterns on the scene.
Now, in conventional phase shifting, these patterns are very low spatial
frequencies and that's what makes them susceptible to the problem of
interreflections. So if we do the same analysis that we did for time of
flight but now we did it in spatial domain instead of temporal domain, we can
show that by using high spatial frequency patterns, we can deal with the
problem of interreflections. So based on these ideas we've developed a
method called micro phase shifting where we use only high spatial frequency
patterns. Now, this is only a very broad inclusion. I'm not going into
details here. I just want to show some results for this method. So this is
a scene. It's a concave ceramic bowl. Now, this actually is not very
diffuse. This has very high narrow specular loads as well. Now if you look
at a scene point here, it receives light directly from the projector. But it
also receives light due to this interreflection, indirect light. Now, this
is a shape comparison using conventional phase shifting. You see this
incorrect phase here. And with the same number of images, in this case it
was seven images, this is a shape replica we're using, micro phase shifting.
Now, this is another example. This is one of my favorite examples. Here we
have a shower curtain. And there's an opaque background behind it. Now the
goal here is to recover the shape of the curtain itself. Which is nearly
transparent. Now, this kind of a scenario is used a lot in medical imaging
where there's a tissue which are nearly translucent, and there's an opaque
surface behind that. Now, you'll be okay here if there's only this direct
reflection from the curtain. But now this is all this light which permeates
the curtain, goes back and then comes back out here. So because of that,
this is the shape using conventional phase shifting. There are a lot of
errors, big holes. Now, with the same number of images, this is the shape
using micro phase shifting. Now, there are no holes here, but you can also
recover this fine details like the ripples in the curtain as well.
>>
[Indiscernible].
>> Mohit Gupta: You can do that as well. It's a good question somebody
asked me recently. I haven't thought much about it. But yeah.
>> [Indiscernible] really know the shape, can you actually make use of that
information?
>> Mohit Gupta: You could do that. Yeah, you could have an iterative
approach where you firstly construct the foreground. Once you know that, you
can factor it into a reconstruction algorithm and you can -- yeah, so that's
a good thing to do in the future, yeah. Next I want to very briefly talk
about the problem of scattering because, again, the underlying physical
process is very similar. As I mentioned before, it's an important problem,
especially for outdoor 3D imaging. So suppose we have a convex scene, a
single convex scene. So the counter [indiscernible] is only along a single
direct path, a single sinusoid. But now there is some kind of medium between
the scene and the camera. So now there are all these indirect light paths
which we call backscatter which never even get to the scene. Now, as an
interreflection, these light paths and different lengths than the direct
path, so they are different phases here of the sinusoids. And because of
that [indiscernible], and we get incorrect phase. Now, the main observation
here is that this looking very similar to the case where there's
interreflections, there is one direct component and there is whole bunch of
indirect components. So maybe we can use the same tools that we do for
interreflections as well. So suppose this is a single simulation example.
Here, the scene was a hemisphere. This is a ground truth shape. This is the
shape you think conventional time of flight. Now the important difference
here is that in scattering, the indirect paths are actually shorter than the
direct paths. So the depths are underestimated. And now this is the shape
using micro time of flight. Now, I want to emphasize that this is only a
very preliminary, only a simulation wizard and scattering is a very, very
challenging problem. But perhaps this is something which tells us where to
proceed. It's maybe a good starting step. Again, I want to very briefly
talk about dealing with difficult materials. Now, suppose the scene is made
of something well-behaved, something diffused like wood. Now, these kind of
materials scatter incident light almost equally in all directions which means
that irrespective of where the camera is, it receives almost an equal amount
of light. But now the scene is made of something which is more challenging
like metal. Now, these cameras have a very narrow specular spike as Rick
mentioned. So now, depending on where the camera is, it may either receive
no light at all or a lot of light. And both are a problem. Now, ideally we
would like the surface to reflect light almost equally in all directions, but
we cannot change the material properties. But one thing that we can do is if
we can somehow illuminate the surface, not from a single direction but from
multiple directions, then we can emulate this different -- so one way to do
that is to place a diffuser in front of the light source so that it acts as
an area light source, which illuminates the scene from multiple directions
and now we get this reflexes in all multiple directions. So the light
received by camera is now not coming from a single direction but
[indiscernible] direction. Based on these [indiscernible], we have developed
a method called diffused structured light where we place a diffuser between
the projector and the object. The important thing to notice here is that the
diffuser cannot be just an arbitrary diffuser because it's going to destroy
the projected pattern. So there's a trade-off here. We need to somehow emit
an area light source without blurring out or without destroying the pattern.
So the key thing here was to use a diffuser which is linear meaning it
scatters light only long one direction. And if you use a structured light
item which is also linear, and you align this diffuser along that pattern, we
get both advantages. We don't lose the structured light pattern, but you
also get an area light source. This is a coin, metal. This is the shape
[indiscernible] using conventional structured light. This is a profile view.
You get a lot of errors here. And with the same number of images. This is a
shape using diffuse structured light. These are some more reconstructions
using diffuse structured light. And by the way, all [indiscernible] are
[indiscernible]. This is a regular [indiscernible] camera, a pocket
projector, and a cheap hundred dollar diffusers. This is a lemon. This is
organic. So it's translucent. Meaning there's a steam point. It gets
direct line. But the sunlight which permeates beneath the surface and comes
back out. This is very similar to interreflections so we can use maybe micro
phase shifting to recover this thing. So this is shape using conventional
phase shifting. There are errors due to scattering. And this is the shape
using micro phase shifting. Now, two of these methods, micro phase shifting
and diffuse structured light, were recently licensed by a big company in the
space of industrial automation. And they're soon going to release products,
hopefully in the next 12 months or so, for robotic assembly of machine parts,
mainly automotive parts, and also inspection of electronics, printed circuit
boards. Now, this is a huge multibillion dollar industry and it's very
satisfying to see our research make an impact here. So do we have time?
Five more minutes?
>>
Sure.
>> Mohit Gupta: So in the next five minutes or so, I want to just quickly
talk about the lessons that we've learned from all this work and what are
some future research directions. Now, a central theme of all this subject
you've talked about is to develop computational models of light transport or
how light interacts with the physical world. We have developed models for
many different processes like interreflections, scattering, specular
reflection, et cetera. For example, we talked about using the phasor
representation to develop light transport model for time of flight imaging.
Now, here we use a phasor to model the temporal variation of intensity. But
we know that light is electromagnetic wave meaning even if the intensity is
held constant, there's an underlying electric field which oscillates over
time. And oscillation of the electric field is also sinusoidal like the
intensity that we talked about. So one interesting direction is to build on
the tools that we use for time of flight to model the temporal variation of
electric fields. And I call this coherent light transport because now we are
really trying to model the coherent or the wave nature of light. Now, once
we do that, it really opens up the entire electromagnetic spectrum starting
from terahertz down to x-rays. Right now, vision algorithms are mostly
limited to visible or maybe [indiscernible] UV. Now, each of these
modalities, each of these bands interact differently with the physical world.
For example, we know that visible light cannot penetrate surfaces but
terahertz and x-rays can. So each of these bands can tell us something
different about this scene. For example, suppose there is an object here.
If you use visible light, we can learn about the surface appearance. But if
you want to learn about the underlying material properties, what this object
is made of, we want to use terahertz or x-rays which actually permeate
beneath the surface and then come back out. Now, the phase of light, the
phase of the coherent light of each of the sensor, it's a function of the
part length here, which in turn is a function of the material properties. So
fortunately now there are cameras which can measure not just intensity but
also electric phase. So it would be nice to develop algorithms which can
interplay these intensity and phrase images to start making inferences about
the material properties as well as maybe some other high order properties.
With these kind of algorithms and the right cameras, we will be able to -- we
may be able to take vision beyond the visible spectrum into these different
kind of modalities.
>> In this particular example, that internal scattering, there must be
millions of such paths.
>> Mohit Gupta:
That's right.
>> So wouldn't the phase pretty much get totally destroyed by the
superposition of all these random paths, the coherence just disappears?
>> Mohit Gupta: It depends on the wavelengths that you use. If you use
something like x-ray maybe, which has a very short wavelength, then the
variation in the path lengths is going to be larger than the wavelength
itself. So then you will lose all the coordinates. But if you use something
like terahertz where the wavelength is [indiscernible] millimeters or even
centimeters, then you have some hope of keeping the coordinates. So there
are many applications for something like this. For example, you can start
thinking about tools for personalized health monitoring because you can start
recovering properties about skin. So this can potentially be used for early
detection of skin cancer maybe. This can be very useful for robotic grasp
planning. Now, for these applications, we want to know the material
properties of the object that we're going to hold. Is it soft versus hard,
rough versus smooth, et cetera. Another interesting application is robotic
agriculture. We want to figure out whether the crop is right for picking.
Now, this can be very useful for preventing a lot of food wastage. Now,
another application is [indiscernible] has a property that it amplifies very
small motion. So if you can model that, we can start building tools for
mechanical vibration analysis as well. It can be very useful in industry.
Now, another direction, and this is very speculative, is that we can start
building models toward different exotic physical effects. There's one
example, birefringence. Many different materials like glass [indiscernible]
mechanical [indiscernible] develop two different refractive indices. And
this is what results in this [indiscernible] effect which we see a lot in our
wind screens. We all have used polarized glasses at some point or another to
reduce glare while driving. There's another effect, fluorescence. Now, a
large fraction of materials known to us are fluorescent, which means that
even though the visible light image that we capture has some information, but
we can recover much more information, much more hidden information if we look
at the florescent image. And finally, there are some processes which are not
even natural which are only present in [indiscernible] materials like this
negative reflection. Now, it would nice to build models for these processes
so that we can design vision systems that can capture information that was
not possible with existing systems. Now, building these models will not be
sufficient. We also need the light cameras to capture this effects. As
Heisenberg, the physicist, not the chemistry teacher, if you watch
[indiscernible], said that what you observe is not nature itself, but nature
exposed to our method of questioning. Now, imaging and vision systems
interact with nature by asking these visual questions, by capturing images.
So we need to design systems which ask the right kind of questions. Now, in
vision so far, we have used this pinhole model of cameras where light rays
are mapped from a scene to a slide detector through this pinhole. Now, this
is fine for capturing images for human consumption, but it can be restricted
for computer vision systems. Going forward, we really need to expand the
notion of cameras to a general light recording device which may map the light
from the scene to the detector in all ways. Like in different new kind of
optics. The detector itself may be curbed. It may even be flexible. The
camera may measure light across a wide range of banks. But perhaps the most
importantly, these cameras will not just passively observe the scene. They
will have a programmable light source will actively influence the image
formation process. These light sources will act as probes that tease out
[indiscernible] formation from the scene. Now, this field of active
illumination or active vision systems, it's a very rich field in itself.
There is a lot of research that has gone it into it. Not just in vision. In
many different fields. Right now, I and Shree are in the process of putting
together a book on which gives a comprehensive introduction on all these
active methods. We are learning a lot while writing this book and we hope
that it will excite the readers about this field of active vision and vision
[indiscernible]. So thanks a lot for listening and I'll be happy to take any
questions.
[Applause]
>> Mohit Gupta:
Yes.
>> [Indiscernible] light experiments earlier, you mentioned that you're
changing the speed of your mirror, right? So you mentioned that you might
use slow, medium or fast speeds, but how are you actually calculating what
that speed should be?
>> Mohit Gupta: So we have a system where the camera first captures two
extra images, one with the light source on and then one with the light source
off. And that gives you an indication of the ratio of the projected light
versus ambient light. And based on that, it's a feedback loop kind of a
thing. We change the rotation speed.
>>
Okay, I see.
>> [Indiscernible] examples which you showed for the first part of the talk,
the results of using your system in ambient [indiscernible], the examples you
showed for Columbia, the different slides. How big were they? Were they
cropped or were the images, is that what your camera is actually capturing?
>> Mohit Gupta: Right now, that's what the camera is capturing. The scene
distances that we used were about one to two meters. And the light source
that we have is actually very small. It's like a pocket light source. Now,
it may be -- it will be possible to scan the entire scenes if you use a
slightly larger source.
>> [Indiscernible] you were talking about this tradeoff between pointing the
light source at a small part of the scene versus the whole field of view, so
in this case, field of view is an important variable in this thing, and so do
you think that you can actually figure out a way to also build
omnidirectional -- not omnidirectional but large field of view cameras?
Because a lot of the technology that's there has this issue of
[indiscernible] relatively small field of view.
>> Mohit Gupta: That's true. I think that the field of view of the camera
is actually limited by the field of view of the light source. So right now,
we have to use a relatively small light source field of view because you're
limited by light source power. If you can come up with methods -- and this
may be an example where we can increase the field of view of the light
source, then we can increase the field of view of the camera as well. And
perhaps even go to an omnidirectional sensor.
>> A problem that in practice that we [indiscernible] distance,
[indiscernible] distance, the depth field. So [indiscernible].
>> Mohit Gupta:
So are you talking about depth of field issues?
>> Yeah. So for example, same camera, I would like to capture
[indiscernible], so be able to capture near distance and also far distance.
>> Mohit Gupta: So if you have a large depth variation in the scene, there
can be two problems. One the limited depth of field of the sensor, right?
So what you're saying is that maybe if you focus your camera behind the scene
or towards that part of the scene then the front part of the scene will be
out of focus. Now, that problem you can deal with by just using a short
smaller aperture. Or you can use many of these coding aperture imaging
methods that are developed in computational imaging. So that's a blurring
problem. Right? Now, another problem that you may have is the dynamic
[indiscernible] problem because you're using this active light source and if
the scene is very close to the camera, that will receive a lot of light. But
if the scene is far from the camera, you will not receive a lot of light.
the question will be what part of the scene do you optimize for.
>> [Indiscernible].
requirements.
You always want all of them.
So
That's practical
>> Mohit Gupta: Right. That's an engineering decision. You can either
optimize for one small depth or if you know [indiscernible] that you're going
to have a large depth range, then you can optimize your parameters
accordingly. Like the light stream that I talked about, right now, we assume
that the range of scene depths is small. But if you knew [indiscernible],
then you could do something more smart.
>> [Indiscernible] sounds like same kind of problems that offshore seismics
have to solve. Offshore seismics, you have that [indiscernible] that
generate mechanical engineering [indiscernible] so they ocean finds the rock
bottom and the [indiscernible] coming back.
>> Mohit Gupta:
That's right.
>> So [indiscernible], are you absorbing this knowledge from seismics to
actively elimination.
>> Mohit Gupta: So seismic is one example, but maybe perhaps a closer
example is radar.
>>
[Indiscernible].
>> Mohit Gupta: I mean, in seismic, you are using ultrasound which are sound
waves. And they're a slightly different model as compared to transfers this
light waves right? So as I say, the closer model is radar and a lot of these
problems are being looked at in radar community as well, and the problem of
interreflections, et cetera. Now, there, the solutions are slightly
different because of the difference in wavelengths involved. Now, many of
these problems are not that severe if you are using radar because the
wavelengths are large and the kind of scene structure that you are observing,
the scale is actually smaller than the wavelength of light that you are
using.
Sing Bing Kang:
[Applause].
Well, let's thank the speaker once more.
Download