>> Sing Bing Kang: Good morning everyone it's my... graduated with a PhD from Columbia University and now he...

advertisement
>> Sing Bing Kang: Good morning everyone it's my pleasure to introduce Oliver Cossairt. He
graduated with a PhD from Columbia University and now he is with Northwestern where he
founded the Computational Photography Lab. Oliver is an expert in occupational imaging and
he is going to be talking about some of his projects today.
>> Oliver Cossairt: Great. Thank you, Sing Bing. First of all, it's great to be here. Thanks, Sing
Bing for inviting me. Basically, what I have in mind for this talk today is to just give you a bit of a
flavor for the different types of projects that we work on in my group. I've been at
Northwestern now for about three years. I'm just starting now to build up my group. We are
interested in basically anything where we get to play around with actual optical
instrumentation, so like building cameras, building imaging systems, and also get our hands
dirty in terms of developing both the algorithms and the instrument itself. That means actually
playing around with the optical side of things and the sensing hardware or any combination of
these things. Broadly, we are interested in applications that span anything in the field of optical
imaging, anything in the field of computer vision related applications, or overlaps between that
and image processing and computer graphics. That's basically high-level idea of the interests
and motivation of my research group. What I'm going to talk to about today is essentially three
different projects that sort of give a flavor of the kind of things that we do and the kinds of
things that we are interested in. And I'll also give you a sense of the diversity of the type of
work that we do and want to do and basically various levels of technical novelties. The first one
that I'll talk about is this motion contrast 3-D scanner that we have developed which is
essentially what we think is a very light efficient power usage efficient way to do active 3-D
scanning. I'll tell you about that with some technical details and then I'm going to talk for a
while about these surface shape studies which are of some very famous artworks that are
housed at the Art Institute of Chicago. We do some collaborations with cultural heritage
imaging, things like the Art Institute in a couple of other museums that we are working with as
well. There are things like the George O'Keeffe Museum and basically what we are interested
in here is also doing 3-D imaging. The technical novel here is not as great, but the value to the
conservation community is significant because they have never used techniques like this in any
of the analysis that they do. The last thing I'll talk about is about this phase retrieval problem
that we just started to get interested in and this is part of a broader effort in my group to look
at the specific image processing problem that has had a long history. Mainly the optics world
has now gained some interest in the more theoretical mathematical foundations on image
processing side and it has some really interesting applications in biological imaging that have
just emerged in the last couple of decades. I'm just going to give you a high-level overview of
all of these projects with some technical details on the first in the last one. I'll talk about MC3D
or motion contrast 3-D scanning. This all the technical work for this project was done by my
PhD student Nathan Matsuda in collaboration with Mohit Gupta who I think some of you guys
saw give a talk here also about active 3-D scanning not too long ago. This is really the high-level
idea that we were going after. The idea is to take a point scanning system like this. We have a
laser dot that is moving across our scene and our goal is to capture information about the
position of that dot as it scans across the scene and the traditional way to do this would be
capture sequence of photographs and then that would be essentially equivalent to a point
scanning information you would need to capture for point scanning 3-D laser scanner. The
issue there is really the amount of data you need to capture. If you have an image sensor that
is M by N pixels and you're scanning your point across each pixel in the scene that's going to
require M times N images and the goal that we're going after here is that really the amount of
information in the scene as we're projecting this dot and moving it across the scene is actually
really minimal. What we want to do is figure out some way to extract from our sensor just the
pertinent information about the location of the spot in the scene and then essentially stacked
that up as a queue, a list of numbers as a function of time as the point is scanned across the
scene and the goal there is that if you have some way to do that then you can reduce it from M
squared times N squared measurements to just M times N different pieces of data that your
sensor has to capture and send through the system pipeline to the computer to the 3-D
processing. That's the high-level idea. Again, the goal is to do 3-D capture, so the reason we're
looking at 3-D scanning is that basically this is the way I think about it. Broadly, we can
categorize three capture type techniques into the most basic ones or the passive ones which
are the stereo methods. Two cameras looking at different views of the scene with different
amounts of parallax. When you look at the shift of different types of texture features you can
compute the depth based on the baseline between the two different cameras. Very close to
related to depth from focus or depth from defocus techniques were you use a depth dependent
blur to try to figure out what the distance away from the scene is. These passive techniques,
the idea here is they can work well for scenes that have strong amounts of texture, but they
really don't work if you have no high-frequency spatial information in the scene. So if you are
trying to capture the depth of a wall, you are capturing to images of the wall at least from
perspectives, the images look the same. If there's no texture them you can't calculate any
parallax information. That's why we looked at active scanning systems. Active scanning
systems essentially project texture in one form or another onto the scene that allows us to
establish correspondences between in this case a trajectory projecting light and a camera
capturing light. Laser scanners are one form of active 3-D acquisition systems, but they're part
of essentially a broad class of structured light systems. Again, all of these techniques are based
on triangulation, having the camera displaced with some baseline to a projector and a projector
is projecting light into the scene. The laser scanner the light pattern it's going to project is just a
dot that gets captured in sequence. For a system like the structured light system and the Kinect
1, a single pattern is projected. The scene is captured by the camera and then the informations
of this pattern are used to determine parallax information essentially trading off some sort of
spatial resolution or assumptions for getting a 3-D image in a single snapshot. The triangulation
structure like methods work a different principal then time of flight principal. So time of flight
principal would be like the latest Kinect 2 type of a sensor. The way that that works is you
temporally modulate the elimination from the source and then you have some sort of
demodulation that happens on the sensor and you use that figure out the time of travel from
the illumination source to the detector back to the sensor which then you can use to backtrack
you calculate the depth of the scene. For the rest of the talk I'm not really going to talk about
time of flight systems because I'm going to focus on triangulation-based systems. A high-level
comparison you can keep in mind about the systems is that for the temporal modulation
systems you need to have high modulation frequencies to get high depth resolutions and there
are limitations both in terms of, literally on the sensing side in terms of frequencies that you
can use on your sensor and the depth resolution that you can get. The triangulation-based
systems, the depth resolution depends on the baseline and distance to the scene. In general,
for closer scenes you can do much higher depth resolution with triangulation they structured
light system, whereas for a time in-flight system you can get very high resolution, depth
resolution even at really large distances away, so that's sort of like the way to compare these
two different technologies. Here's an example of being able to use triangulation-based
structured light systems. In this case there's a line stripe laser scanner to get, in this case I think
they got some millimeter accuracy in this scan. This is the traditional Michelangelo project
where they scanned Michelangelo's David. You can get very, very high quality 3-D scans with
laser scanning. That is essentially where we started out thinking about is we want to push
towards this type of very quality 3-D scans as much as possible. A taxonomy of structured light
systems. Here I have sort of my classification of this sort of continuum between different types
of structured light systems. On the top left here we have a laser scanning system, so we have a
point being projected from a laser a point or a line being projected from a laser that's being
swept through the scene. We capture a sequence of photographs from our camera and then
we figure out what the disparity is, essentially the difference between the column that is
projected out from the projector and the column that is received on the camera. That gives us
depth information that is scanned over time and if we have, if we are doing point scanning and
our resolution is M i N, we need to capture M times N images. If we are doing column scanning
then we need to capture M images which could be significant, as much as like 1000. The way to
make this faster, and remember the key here is just establishing correspondence between the
projector and the camera. We are essentially trying to find this triangle, sending a ray out from
your projector and then finding where that ray projects to in the camera. That's the whole
goal. That corresponds as a key to calculating the depth because once you know the geometric
relationship between the camera and a projector, you can use that to calculate the depth.
There are other ways to capture, to get this correspondence information faster. One of the
ways is to use a binary coding scheme. The idea is you project a series of binary patterns with
different spatial frequencies and what they do is uniquely code each of the columns in the
projector's frame of reference so when you measure a bit pattern on your camera on off on off,
whatever it is, you can uniquely figure out which code in the projector that column belongs to
and that establishes the correspondence. And then you can calculate the depth. The
advantage here is you can use a divide and conquer approach essentially getting a lot of
rhythmic time that here would take you to do linear time in terms of the number of columns.
You can add it to the mix intensity modulation, so now in addition to, you basically assign a
unique intensity identifier to each projective pattern. One of these examples is the phase
shifting type of pattern where you project sinusoids. Other examples are phase ramp where
you just project a gradient in intensity. For instance, in the phase shift pattern you can capture
you use what is in principle just three images. In the phase ramp you can just do two images
and basically what we are seeing is as we are moving in this direction the ability to capture the
images faster. Here I have the example of this first generation of Kinect where we are just
projecting the single image and we are using smoothness constraints to recover the depth. This
is sort of the space of different types of textured light systems. On the left-hand corner to the
laser scanning system we have what is essentially a very slow system. However, for the system
the light is highly concentrated because we are only projecting all of your available light you are
focusing to one part of the scene. And then for all of these systems you are eventually flooding
the scene with lots of illumination and the advantage of that is it's giving you faster acquisition
system. This is a trade-off.
>>: About 10 years ago somebody proposed a system using fraction grading to generate a
sweep of wavelengths of each wavelength being one plane and then you could simply just look
at the color of each pixel in the scene and determine based on that because it's monochromatic
light, and that any one point determine which plane you were looking at. Does anybody use
that technique today?
>> Oliver Cossairt: What you are describing sounds like OCT. Does it use coherent skating?
>>: No. It's just a diffraction grating, sends out a rainbow basically, but if you look at any one
plane in that rainbow the light is monochromatic. And so you can look at the ratio of the red
green and blue responses that you get in your pixel, the pixel that's being illuminated, and you
can immediately determine which monochromatic wavelength gave rise to that.
>> Oliver Cossairt: Yeah, that's basically related to phase shifting. You're using intensity coding,
right, so whatever the intensity in the captured image is going to be a unique identifier for the
column which will allow you to establish correspondence. In that case it's just using color in
addition to intensity. That would fall in kind of this category over here, so definitely, people do
that. Commercially, I don't know of any system that is out there that does that, and I'm not
sure exactly why that specifically doesn't exist in a commercial product. But it could, certainly.
The point that I was trying to get at here was this trade-off between how fast you can capture
the data and the way that your using your light efficiently and assist. And the point there is that
it has a really strong implication for some particularly challenging types of 3-D scanning
environments. For instance, here's a comparison between laser squinting, wind scanning in this
case and then gray codes. This was sort of the ideal scenario where you project light at the
scene and only like that exists in the universe is coming from the camera. However, in practice
there's always going to be some light just ambiently in the scene coming from overhead light
that is bouncing off the walls. If you're trying to go outdoors, that means you have to compete
with light from the sun. Your model is that your only illumination source is the projector. In
reality, that's not the case. However, for our laser scanning system, we are taking all of our
available power and we are concentrating it to a small line.
>>: [indiscernible]
>> Oliver Cossairt: Yes. Actually, that would be great. Is that better? Better contrast. Okay.
The goal here is to establish the correspondence, figure out, he'd know which column in the
projector projected this line. You want to figure out what it maps to the sensor, so you use
some sort of binary threshold or some equivalent kind of algorithm and figure out the location
of this line and now you have established a correspondence. It works fine in the absence of
ambient illumination. If you add ambient illumination to the mix, providing you have sufficient
optical power and in this case remember you are taking all of your optical power and you are
concentrating it in a narrow line, you can just adjust your threshold and still get a good
binarization. But the issue here when you switch over to a gray coding type of a scheme, which
remember, the whole motivation was to make this scanning mechanism faster, so what you did
is you floodlight scene with illumination with this binary coding pattern. Now that you're
conserving optical power so the power that you used to project onto this line, you have now
spread it over a larger area. As a result, of course, when you have the ideal case you can get a
good binarization, but when you add ambient light into the mix, you now have your ambient
light being much stronger in certain areas in your scene and you get, even if you adjust your
thresholding, binarization errors still result that propagate to depth errors in the scene. Yeah?
>>: There is an obvious fix for this, right? Are you going to tell us what that is? I mean you just
take subsequent images and say when is it above or below the average, right?
>> Oliver Cossairt: Yeah. There's a lot of work in trying to make binary and related coding
patterns be robust in ambient light. Yes. There is definitely an impact and I think Mohit talked
about several of these techniques when he was here a couple of weeks ago. I'm actually not
going to get too much into detail about that, but you're right. That's an important point of
comparison. Today, we do not have exhaustive comparisons with those techniques. Hopefully,
you can sort of suspend your questions about that and maybe we can talk about it or off-line.
The first one was ambient light. The second closely related one is global illumination. The idea
here is that we are looking at an object that, in this case we are looking at a book. It's like a
wedge pattern. Our model is that light goes from our source, bounces off the object and goes
to this sensor. This is sort of the picture of what that would look like if that were true. This is a
picture of what we actually see. Light goes and hits the object. It bounces around different
places in the object and then comes to the sensor. This is just a reality of the way that light
propagation works for conventional and coherent imaging systems. We run into the same sorts
of problems here which, again, can be mitigated and there are actually some very specific cases
that you can illustrate where the technique will work for our laser scanning system. And for the
gray coding system you have the same sort of problem where the light that bounces around the
scene eventually gets low pass filtered and just looks like a constant offset that is closely related
to the same problem that you get with ambient illumination. Again, this is a scene dependent
problem here. This problem exists for convex things where light is bouncing, has the potential
to bounce around multiple times before it gets to the sensor. If it was a concave object, vice
versa, then it wouldn't be a problem. This is basically the space that we were trying to chart
out here in looking at 3-D scanning systems this way. On one axis we have acquisition speed.
We like to be able to acquire 3-D scans fast. We like to do them at high resolution. And we
would also like to have our systems the light efficient because the more light efficient it is the
more robust we can do our 3-D scanning relative to essentially light pollution from the scene. A
laser scanning system, we put some up over here on the upper left-hand side. It's very light
efficient. It takes all the light power and concentrates it to a small area. But it's slow, but it's
high resolution. And then we can plot out other systems. We get great coding and phase
shifting. Those are higher-speed. We've essentially traded off speed here for light efficiency.
Instead of concentrating the light to a small area we are now spreading it across the scene and
then you can throw things like single shot methods like Kinect out there and we can also look in
the literature and see how there are systems that essentially offer trade-offs between these
different extremes. Mohit had a paper where he was looking at hybridization between gray
coding scene and they laser scanning scene. It was a scene dependent trade-off parameter that
allows you to adjust between the two systems dynamically trading off light efficiency against
speed. There are also methods that are sort of hybridizations between single shot and gray
coding essentially looking at the motion in the scene and changing the patterns that are being
projected depending on if there is motion or not. This is our taxonomy of the space and this is
where we are trying to get is in the upper right-hand corner. This is the idea. The idea is to
take this new principle. Not a new principle, it's a very old principle, actually. Biologically
inspired principle, which is to sense, instead of sensing optical intensity, directly sense motion
contrast. This is a comparison. The very high-level idea. We have some source with time
varying intensity. We have photon fluxes that are changing the function of time. For a
traditional photo sensor that would get converted to current or time varying voltage. And then
this is all happening internally. This is essentially the photo sensor on your pixel, and then that
is analog time varying photocurrent would then get sampled at different points of time
periodically using additional analog converter. The motion contrast sensor, the idea here is to
take that same time varying photon flux, converting it to a time varying voltage in this case and
then in the analog form electronically perform a temporal differentiation, and then have a
thresholding mechanism that is looking around waiting for this temporal derivative to exceed
some threshold and when it does you spit out some information. This system here is by nature
an asynchronous sensing mechanism. This one pulls for information even when you get no
intensity here. This one only spits out information when the value exceeds some threshold.
The idea is to say biologically inspired idea. Constant stimulus to photoreceptors in your retina
produce no output, so the way that your photoreceptors in your retina send impulses to your
visual cortex is by sensing changes in intensity over time. This is enforced by at least partially
by micro-sicods [phonetic] that are always causing the scene to scan, so a lot of times what you
are seeing is edge information that gets mapped to temporal information and then that gets
differentiated by the internal circuitry in the photoreceptor, and it's a compression mechanism.
It's a way to take the hundred million photoreceptors in the retina and essentially compress it
down to 1 million or so signals I get sent down the optic nerve. It's the same idea here. People
have built sensors based on… Yeah?
>>: Do you get better signal-to-noise ratio if you have suddenly varying input and then you had
a batch temporal filter as opposed to a differentiator?
>> Oliver Cossairt: That sounds plausible, to get a better signal-to-noise ratio. That's a good
question. You have one signal and you are sensing that signal only. That sounds like the ideal
way to sense that signal. Yeah. I'll have to think about that some more. That's a good
question. What you are getting out is a good point is that we are playing around with time
domain stuff. It ends up being very closely related to the time domain modulation of time flight
type of techniques, although there's going to be no time of flight information in this. This is
going to be a triangulation based technique. But we are incorporating temporal information
implicitly into the way that the sensor is designed. Basically, there's a team that ETH Zürich, a
sensing team there that has played around with sensing architectures that are based on this
motion contrast principle. This is a picture of high-level block diagram of the circuit. You have a
logarithmic photoreceptor, so you have time varying photon flux coming in that gets converted
to a time varying voltage that is proportional to a logarithm of the incoming flux. Then you
have a temporal differentiator circuit which takes this time varying voltage and differentiates its
and then you have a threshold mechanism that looks for when the time varying voltage
exceeds, the derivative time varying voltage exceeds some threshold. That gets pulled
asynchronously and gets sent out as events to the circuitry of the readout circuitry of the
sensor. This is one pixel, basically. There would be a whole bunch of these replicated in an
array and this is what the output of the sensor looks like. The idea here is that this is intended
to be apples to apples comparison in terms of bandwidth, so the same number of bits per
second that are coming out of these two engine systems. On the left we have conventional
video camera. On the right we have a motion contrast camera, and the point here is that on
the left we're using a lot of bandwidth to measure sensors but we are not changing intensity
very much. On the right any pixels that are not changing in intensity are not being sensed, and
what we are doing instead is we're using that extra bandwidth to sense the motion at higher
temporal resolution. That's the motivation for using this sensor and it's been used for things
like high-speed tracking. There's actually some surveillance type of stuff that they use it for and
then there are a lot of people in field biology that are interested in this because they can do,
can put a sensor out there to watch like when a rabbit comes out of a hole and it is not
consuming any power until there is movement in the scene. That's the main advantage of it.
This is the idea. The idea is really simple. Take our laser scanning system where we are
basically scanning sequentially through to the columns in the projector's frame of reference and
then take that system and send it to a motion contrast camera. That's it. The point here is that
what we're going to do now is we're going to scan this laser scanner fast, so we're going to do it
at 60 frames per second, 60 Hz repetition rate. At that speed we're going to basically assume
that the motion in the scene is static. As a result, the only intensity changes in the scene are
caused by the laser pointer increasing to reflect to that point. Going to the sensor essentially
produces a spike in the time varying voltage that's produced by a pixel. That gets differentiated
thresholded, produces a single event that tells us a location of the laser scanner at that specific
point in time. That gets sent as a single event. It gets sequenced over time as the laser scan
moves position. Each of these events produces two pieces of information. One is the column
that gets mapped to in the sensor and the other is the time of arrival that that event occurred.
Based on calibration between the timing of the projector and the timing of the camera, you can
produce a conventional disparity map and then use the intrinsics of the pose and the baseline
of the camera projector to compute a 3-D model. That's the idea and the real advantage in
comparison to a conventional laser scanning system is basically this bandwidth utilization. The
idea here is if we have a line scanner here, the amount of time that it takes to scan each line in
the full scene would be equivalent to the amount of time it would take given a sensor with
equivalent bandwidth, the same number of pixels per second that are being read by the sensor,
you'd need to capture and different frames where each frame is only capturing one column of
essentially depth information. Most of the pixels in the scene correspond to essentially wasted
bandwidth, pixel measurements that don't really correspond to what you're trying to sense in
scene at all. We can do the equivalent of laser scanning in real time. That's the idea. This is a
prototype. This is the DVS 128. This is that motion contrast focal plane array that I was talking
about before. It's a commercial product but it's sold in small quantities right now. There's
essentially a startup company that is looking for applications for this. We approach them with
this idea of building a 3-D scanner based on this and built this prototype. The projector, what it
was was a Show X projector from Micro-Vision. It has a mems 2-D scanner and a laser and
basically what it does is the laser projects to the scanner and then it projects out to a point in
space and then you intensity modulate the laser as you scan the mems scanner and a produces
an image using essentially just laser temporal intensity modulation synchronized to this 2-D
mems scanner. It's a commercial product. It was produced relatively inexpensively. It's like a
$300 pico projector. For one reason or another the product is not available anymore. I'm not
sure where the technology For Micro-Vision sits at this time. I been told that there's a new
company that's going to be commercializing it. We use that system for our prototype and have
since switched to our own custom-built laser scanning system for various reasons. Here's a
comparison. Here we are comparing against Kinect 1 and the main thing here is that we are
getting the same scan speed as the Kinect 1. They're both trying relation based systems. That's
one of the reasons why we're doing the comparisons between those. The point here is what
sort of performance can you get out of, how do you maximize performance on a triangulationbased system. That was a comparison we were trying to make here. The point is that we can
get laser scan quality all things bidding equal in terms of resolution. This is again relatively low
resolution for this prototype, 128 x 128 resolution, but we can certainly do better than the
Kinect 1 given the same amount of time to produce a scam. Again, close to laser scan quality
even though we get orders of magnitude decrease in the amount of time it takes to produce a
scan. Here's a real-time scan result. It's just a person spinning a coin. In this sense this is 128 x
128 pixels at 30 frames per second and the only processing done here is some 2-D median
filters to correct for dropped pixels. Yeah?
>>: [indiscernible] by rotating the figure or something? How did you, because it's in motion,
right?
>> Oliver Cossairt: No. It's based on motion in the sense that there's change in intensity over
time. The way that we induce the change in intensity over time is projecting a laser spot out in
the scene and moving it really fast. That induces a change in reflectance. Brightness is a
function of time that is sensed by the pixel which is the information that we take advantage of.
We assume the object to be stationary over the duration of the scan.
>>: [indiscernible] like when you project something you might have multiple bounces and it
might exceed the threshold and you get multiple points.
>> Oliver Cossairt: You can show that as long as the portion of the light that is reflected back in
the direction of the camera is greater than the amount of light reflected in some other
direction, meaning that it's basically not a perfect mirror or not too specular. Then you are
guaranteed to be able to set a threshold that will work with the system. I'll show you some
examples of highly reflective objects, or one example.
>>: On the one that you showed us why is the table cut out? Why was there only the hand and
the coin?
>> Oliver Cossairt: I don't know. I would have to ask Nathan about that. I'm not sure. This
example of pinwheel in comparison to Kinect 1, identical frame rates. The real advantage that
we are pushing for with this is performance in ambient light and the hopeful application there
is outdoors, doing robust outdoor 3-D active triangulation-based scanning. Here's an initial
experiment. Kinect 1, again, comparison, starting out with 150 lux which would be like this
room and then increasing the illumination to be up to 5000 lux which would be a cloudy day.
We can obviously do better than the Kinect 1 and the take-home message here was at this was
about the ambient light on the surface was about 10 times greater than the laser, so basically
you couldn't see the laser point in these images because the laser is so much dimmer than the
ambient light but still we were able to get a good 3-D scan despite that. This was our first
generation. This was actually using visible frequencies, red light without any color filtering, so
actually the sensor even though we're projecting narrow band with illumination, we're not
filtering that narrow bandwidth on the sensor which we could use to essentially block a lot of
the ambient light getting to this sensor. That was our first generation. This was our second
generation. In this system we have our own custom-built laser scanner using galvanometers
instead of using this really cheap efficient implementation from Micro-vision that unfortunately
at this point is difficult get hardware from. At this point we have switched over to infrared
wavelengths where ambient light sources tend to have lower intensity and we have also used a
color filter now to block all wavelengths within the narrow bandwidth except for the ones that
we're projecting out. This is probably about as good as you could do. This would be similar to
the tricks that are being played inside of the Kinect 1 and 2 systems to block out ambient light.
In this case we're showing you, so this light bulb was on. On the surface of the light bulb is
50,000 lux, which is quite bright. It's equivalent to about typical reflectance of what a human
face would be in direct sunlight at about noon. It's about as bright as you can get outdoors.
And we're doing in real time, we actually are getting you a 3-D scan of this light bulb as the light
is on. So we're projecting light very dim onto the surface of this light bulb and we're actually,
based on the light that we're projecting onto the bold able to compute this geometry despite
how much light is coming from the bulb.
>> That's a fluorescent light, though, so it's not putting out a lot of IR at all.
>> Oliver Cossairt: This example here is visible in this case. The next example I'll show you is
infrared outdoors. You're right. That detail I glossed over. We started out doing this invisible.
The only difference here, this is still visible. This is 633 but we have a narrow bandwidth filter in
front of and then the next one will be outdoors with switching over to infrared. Now we're in
infrared. This is the system here. This is outdoors, noon on a sunny day. This surface is the
illumination on the services 80 kilo lux. We're about 4 meters away here and we're going to
scan a person. Here's a comparison. There's a field of mismatch here between our 3-D
scanner, and in this case we are in comparison with the Kinect 2 because the Kinect 2 has much
better ambient light rejection than the Kinect 1. There's a field of view mismatch which is just
really a practical issue between what focal length we had to use for the sensor and the
projector based on the scan angles of the specific scanning optics that we're using. The point is
that we're getting a high-quality depth map here even in this 80 kilo lux and the illumination
and we can see that clearly we are getting failing in the Kinect 2. We can definitely do better
than the Kinect 2. We can engineer this even further, but this is where essentially the state-ofthe-art is for us right now.
>>: In the one in the middle when you watch the video when the person stops moving the
hand in his hand is closed the camera can see how sharp silhouette of the hand is. Whereas,
when the hand is moving you can see that it's noisy. This [indiscernible]
>> Oliver Cossairt: I think you are seeing basically motion blur. It looks weird because the
sensing mechanism is totally different than a conventional sensor. It's the effect of motion
during essentially single stand periods. Here I think, this one was lower frame rate. I can't
remember why the technical reasons why we had to go with the lower frame rate back here
but I think it was 10 Hz. With
>>: The galvanometer is probably [indiscernible]
>> Oliver Cossairt: Yeah, I think that's right. The scan sees limitation with the galvanometers
over the field of view that we want is just a trade-off that we are stuck with for this prototype.
Here's a first stab at a comparison between doing a book, that V groove type of an object
where the light to you project is going to bounce around multiple times before it gets back to
the sensor and a produces serious errors for the gray coding system and then we're able to at
least more faithfully reconstruct the shape of that V shape there. Here's another example of a
highly reflective object. It's an metallic sphere and we are able to get the sphere pretty
faithfully whereas the gray coding method fails.
>>: You mentioned earlier that if the reflectance is the specular mirror you can choose a
threshold that would give you the correct depth. You have an algorithm for choosing that
dynamically or do you tweak that by hand?
>> Oliver Cossairt: Right now it's all by hand. Basically right now whenever we set up an
experiment we tune the parameters specifically for that experiment, so that's definitely the
next step is flexibility in the sensor operation in how rhythmic programming. None of this has
to be hardcoded. Like you're saying it can operate on the fly and change two different
conditions. That's a good point. I spent a lot of time on that. I'm supposed to end at 11:30,
right? I'm going to give you a really brief overview of these other two projects. I think they are
cool projects and they give you a sense that we are really trying to work on a lot of different
stuff. The point here is there are all these prints at the Art Institute of Chicago that they don't
know how they are made. The reason is because there is this artist Paul Gauguin. Some of his
paintings are the most expensive paintings in the world at this point. We're looking at his
prints, not his paintings. He has this whole other body of work and the process that he used is
really not understood. The art Institute has basically picked some prints that they wanted to
study carefully. This is an example of one that we looked at closely. This nativity. This is the
front of the frame and the back of the print. This was essentially what was hypothesized as the
printing process that was used. It's a standard monotype process. I'm not going to go over the
details. But that hypothesis had some weaknesses. And the main reason is because when I
looked closely at these prints, you could see what appeared to be these broken lines in the
prints. So what they were basically interested in figuring out here is could these broken lines
originate from indentations in the print that were caused by him essentially tracing something
on top of it and then the surface pressure transferring from the top to the bottom the way that
you would transfer secret message from the top of the Post It on top to the one below it, that
sort of thing. Blind incisions is what they called them. They thought that was the origin of
these and we're not sure what to do. I'm going to keep going here. They asked us to make
some 3-D surface measurements of the paintings to try to verify if this was the case. These are
Museum conservators, so they don't have access to anything high-tech at all. They have access
to cameras and lighting equipment and standard photographic equipment. The approach we
chose to go with was photometric stereo because that's something they can actually implement
in their lab there. What that means is we're capturing a sequence of photographs of the scene
from a thick camera view point and changing the illumination. This is the data that we are
capturing. Here's the painting. We are changing the light source around from this reflective
ball here. You can figure out the angle of the light source coming in. And then based on the
angle of the light source coming in and the brightness that you measure for each of those
directions you can back calculate giving some basic assumptions or understanding about the
material properties in the scene you can compute a 3-D surface information and use that to
essentially differentiate between surface color and get to underlying 3-D surface shape. That's
what they're interested in here and that's what we were helping them out with. There were
two main questions that we were able to answer for them in this process. The first was the
origination of these lines, so you can see as these lines are pulled away, you can see very clear
surface perturbations, so hills coming out of the page. This is about 100 micron details size that
we are seeing the height of these perturbations. That was essentially strong enough evidence
for them to indicate that the transfer process was such that the print was placed down,
pressure was applied to the back, which caused indentations on the front, which was the way
the ink was transferred from and ink support surface to the print. That's the first piece of
evidence that we were able to get them. The second piece of evidence was that all of these
locations where it looked like you had these broken lines, we were able to verify that despite
the fact that we know we can measure accurate 3-D surface shapes for these prints, there are
no surface features that correspond to any of these broken lines. That's where we were able to
do for them. What it allowed them to come away with is this idea that the way that he made
these prints was he took ink surfaces that were tarnished to begin with. This is basically the
process that we were able to come up with for how he made these prints. Start off making a
print, he's pushing on the back. ink is getting applied from the ink surface to the front of the
piece of paper and then you pull up the print and you move it away and then this is now the
beginning of the next print which is the one that we were analyzing. Now we put a piece of
paper down. We draw on the back. Ink gets transferred from the ink surface to the front.
However, now because the ink surface was tarnished there are already ink lines that were
removed. There are broken lines that appeared in the transferred ink and then the process
continues for more ink layer and this is the mock up comparison between the front and the
back. Our reconstruction versus the ground truth and they were very happy with this process.
They thought that this was a pretty good explanation for the process that Paul Gauguin
probably used. Last one, I'll just do this quickly. This is a totally different flavor here. This is
actually about x-ray imaging. Although it's related to a lot of other problems, specifically, we
are interested in very high resolution x-ray imaging. The idea here is that you are using x-rays
that have a very small wavelength and as a result you can resolve features on order when you
have your imaging set up appropriately, features on the order of the wavelength which can be
as small as tens or even single nanometer resolutions for specifically what has been a really hot
topic in the last couple of years, which is biological samples. Looking at biological samples, sub
cellular level has been a new phenomena that has occurred because of these very high power
x-ray sources that have come out of these national labs that are repurposed sing now defunct
physics experiments, particle accelerators and using them as essentially high-power light
sources that anyone can apply to to use for their experiments. That's where a lot of biology
experiments are now flocking to these national instruments to use these high-powered x-ray
sources. Biological imaging is one of the great things. The issue here is that the way these
experiments work you can't bend around x-rays way you can optical frequencies. It's hard to
build a lens. It's expensive to build a lens. It's possible to build a lens. It has to be refractive.
You can't make things refractive x-ray. Build a reflective optics is expensive in x-rays. The
easiest by far way to do it is you just shine a light on to your object. That light, remember it's
very small wavelength light. It diffracts. The diffraction pattern you just go directly to a sensor.
That's the whole, it's completely lens less. Send light to the sensor. It scatters. You measure
the intensity. The problem is with the systems that most of the light doesn't get scattered. You
have this large concentration of light at the center of the beam and you get a dynamic range
problem. This is because these are essentially natural images here and natural images, we
know, are very sharply peaked in the center in the low-frequency region and that's what's
happening and they have a very, very large dynamic range. We're sensing the amplitude
pattern of the Fourier transform of this image here. That's what we're trying to do here. It's
problematic and a produces all sorts of issues with the sensor. What's typically done is to avoid
blooming problems with the sensors, they just block out. They put literally a physical absorber
block in the center of the sensor. Just don't sense that light at all. One solution is to do HDR
just as we all know, capture multiple images with different exposures. This is done but it's slow
and there are certainly applications where they would like to do this faster and not have to wait
around to capture multiple exposures to make it work. Single shot imaging, you put an x-ray
beam on your object. It scatters. You essentially measure the amplitude of the Fourier
transform. It has a very large dynamic range, so you throw away the center. That was the
problem that we're looking at here. I'll just skip ahead to the results. I find the most important
point in 3 minutes.
>>: You can take a little bit more time.
>> Oliver Cossairt: Okay. I'll walk through it briefly. It will just be slightly more than 3 minutes.
I'll explain it properly. This is the model. Our sensing model is we take our image of our
specimen x. We scatter it. If the scattering is far enough away it essentially boils down to
taking a Fourier transform and that Fourier transform we actually only measure the module of
that Fourier transform. But given some special tricks we can play like if we just take the inverse
Fourier transform we can get the object back. We need the phase in the Fourier transform
which we are not measuring, but there are tricks that you can play with that. The difference
here is that we're just going to have a high pass filter here. The point is that where blocking all
of the DC frequencies, so if we do just naïve inverse Fourier transform processing, it's
equivalent to taking our image and applying a high pass filter to it. All we are doing here is we
are just applying some regularization to this problem. A very simple solution obvious to
anybody who is done some image processing or played around with some sort of
computational photography tricks, just happens to be something that hasn't been looked at in
this x-ray community, which is basically mostly physics stuff. We're using a total variation
regularization to solve for this problem here. This is the experimental setup. You have your
object here and this is represented by Lena's eye. And then you have a couple dots out here in
the idea is that you measure the intensity Fourier plane, actually the squared magnitude. And
then that's equivalent to if you took an inverse Fourier transform, that's equivalent to taking
this image and measuring its autocorrelation. When you look at its autocorrelation the
autocorrelation looks like this and the point is that because you had a couple of dots out here
the autocorrelation is going to give you copies of the image that you carry about out here.
That's the basic idea for how to solve this problem of you're only measuring intensity out here
but even though you don't have the phase, you're able to sort of get the equivalent of getting
the phase back just using simple Fourier transform, inverse Fourier transform and that sort of
thing. But it doesn't work when you have missing data. It just totally fails when you essentially
just apply a high pass filter to that other image that we captured, you don't get a good
reconstruction. However, if you regularize then you do get a good reconstruction for the
natural images. This part in the center is essentially the autocorrelation of this image. You
can't reconstruct that well because it doesn't know base statistics of natural images, but you
don't care. All you care about is getting these images out here and it seems to work very well
for at least the initial experiments that we've done. This is a real experiment in visible
comparison between a conventional reconstruction and the regularize reconstruction, just very
basic images at this point. We're now working with some physicists at the Argonne National
Laboratory to get real biological data to test this out on. There are some comparisons that you
can make two different types of biological specimens, resolution target. This would be a typical
result that they would get out of their algorithms and then this is where we are able to push the
results using a simple twist on the algorithmic processing. That's it. That's just the flavor of the
projects that we work on. The MC 3D scanning system doing high-quality active triangulationbased 3-D scanning outdoors. That's where we are trying to push it we have a lot of
collaborations with cultural heritage institutions, universities around the world where we are
trying to do computational imaging, 3-D imaging and take it into interesting places specifically
inexpensive implementations is what we are interested in this area of things. And then we are
also interested in these emerging imaging applications in different scientific fields. [applause].
>> Sing Bing Kang: Any more questions for our speaker?
>>: For the last thing where you had the two dots on the side on either corner there, was that
like making something like a hologram?
>> Oliver Cossairt: It's exactly a hologram. It's called Fourier transform holography.
>>: You had essentially two reference sources which is interesting.
>> Oliver Cossairt: Yeah. It doesn't have to be that way. People played around with all sorts of
different types of reference objects, even using multiple points together. One interesting
implementation is what basically boils down to contrast in the interference between the light
defracture from the point source and a light defracture from the object. You want more light
because there's no way, you can't put a beam splitter in the set up to make more light goes
through the pinhole, so ideally you would like to not attenuate as much like and just use a
circular aperture. Square aperture has been done before. Because basically, if you apply to
derivatives horizontal and then vertical, you end up with for delta functions of the corners and
so you can get higher contrast fringes and then get back to the same image that you would've
had if you had had to dots, that sort of thing. What I didn't really get a chance to do is in
practice this is just one of the techniques that they use as holography. What they really like to
do is no interference at all. Just take the diffraction pattern from the object and just measure
the amplitude of its Fourier transform and there is a series of these is so-called iterative
nondeterministic phase retrieval techniques that are used. In that case the techniques that we,
it also works with that case to. It's called coherent diffraction imaging and that's the more
common way to actually solve this problem. The alternative is you actually have to make a
piece of hardware that has those little dots in it and then you need to stick your object in a
known position relative to those little dots and this is nanometer scale. It can be really
challenging. Taking the pinhole side of the equation essentially makes for similar experimental
setups, more convenient setups and that's the regime that it is more often used in.
>>: Is that hard to scale, two beams [indiscernible] for example?
>> Oliver Cossairt: Yeah. It's a pain in the butt. I think any type of optical components for the
x-ray are really a pain in the butt to work with. That's my understanding of the situation. Beam
splitters, focusing optics, all of these things, they do exist and people to use them, but they
seriously add to the complication of the experimental setup. So if they can avoid using it they
like to.
>> Sing Bing Kang: Any more questions? I just have one. How do you know that the results
you show here are reasonably accurate given that A the surface is not [indiscernible] and B
there is a lot of [indiscernible]?
>> Oliver Cossairt: That's basically what we are looking at now. The goal right now is to
actually take ground truth measurements using high precision 3-D surface measurement
equipment. We're going to take a white light interferometer, measure to micron scale the
height maps of the set of representative surfaces that have been produced by the conservators
that have reflectances similar to materials in that actual artworks that we care about. Make
ground truth measurements, do comparisons between different photometric stereo
reconstruction algorithms, different assumptions about using libraries and materials and that
sort of thing. That's essentially what we're looking at now. For this initial work all we did was
assume this version and the results were informative to them even that simple assumption
straight photometric stereo from 1980 Woodward paper, that's all we're doing, but that was
useful to them. Now we're starting to look at it in a little bit more detail to see how important
it is to see what accuracy is required out of these things.
>> Sing Bing Kang: Let's thank the speaker once more. [applause]
Download