>> Cha Zhang: Okay. Good morning, everyone. ... Professor Minh Do here. Professor Do is an associate...

advertisement
>> Cha Zhang: Okay. Good morning, everyone. It's my great pleasure to have
Professor Minh Do here. Professor Do is an associate professor from the
Department of Electrical and Computer Engineering at UIUC. He has received
numerous awards, including a career award from NSF, a Young Author Best
Paper Award IEEE in 2008. He was named Beckman Fellow at the Center for
Advanced Study, UIUC, in '06 and a Xerox Award for Faculty Research from the
College of Engineering. He's also a cofounder and a CTO of Nuvixa Corporation,
a spinoff from UIUC, to commercialize depth-based visual communication.
So without further adieu, let's welcome Professor Do.
>> Minh N. Do: Thank you very much. It is my great pleasure and honor to be
here. It's my first time visiting Microsoft Research. I always admired a lot of work
coming out from this place, and it's a great pleasure to connect with some of the
old friends as well as see some person that I only know through papers and
publications.
And suddenly it resonates all the with the recently activities coming out of
Microsoft. I think now Microsoft is really truly the center of the universe
surrounding their depth sensings and how to use this for immersive visual
communication. So, again, it is a great time to be here, and I hope we can start
up some collaboration and some discussion afterward.
So there is something that -- I've been working at University of Illinois for almost
ten years now, and my original background was in developing high-dimensional
waveless for image representations. So, you know, you can tackle a problem
like compressions, de-noising, and so on.
And about five years ago I get very interested in the depth cameras, the
technology that can provide additional to typical color image that we as image
processor would deal with. Now we can also have a very different type of visual
information, namely, the depth information, and a very interesting question to me
is how can we provide different techniques to explore that new type of data, and
in particular for the visual communication application.
So let's first start with -- so this is something that we dreamed up in 2005 about a
vision that, you know, can we extend beyond television that a typical visual
communication where we only have a single fixed camera recording the event
and the persons and then the viewer just simply passively stay still and watch the
content.
But can we -- for example, very nice pleasure to be here, and we move around,
talk to people. So the freedom not just having a fixed camera, but I can walk
around and see information in two, four-dimensional, you know, 3D in space and
[inaudible]. I think certainly it is a very big freedom that we can enhance.
So what does it mean? So there is a picture actually drawn by somebody in here
in the audience [inaudible] met with my former Ph.D. student at UIUC and now
working at Microsoft. The vision that was very excited at that time is, you know,
we can have multiple cameras recording dynamic environment scene and then
we would do a lot of very interesting advanced signal and image processing that
[inaudible] information, have a very efficient representation, and then trust me
that across a network or storage so that later on in our life we can view the
information, individual persons or different, you know, sessions, we can view that
at different viewing angle, different perspective, different positions.
So that was the dream. And there is a setup, but as you can see, it can be
manifest itself in many different other applications. For example, how can we
generate stereoscopic stream, right? For example, we can record the views, and
now people are looking at the autostereoscopic display which then you have to
synthesize multiple views, and we only have a limited set of recordings so now I
have to synthesize those views. So that's one.
Also, you know, if you take the person out, you want to extract and change the
viewpoint slightly, for example, do some eye gaze corrections, all of these
problem I think to me it is fundamentally a problem of view synthesis using live
recorded data is going to be an exciting problem.
So the vision which, you know, again, get us very excited about is existing
audiovisual communication still only uses single camera, very little processing,
and the viewer just stay there passively.
So as we look in the future at the cameras, the sensors getting very, very cheap,
and we have a lot of computing power and bandwidth available to us. How can
we provide a new viewing experience? And I think that's a very exciting resource
problem.
Why? Because in the middle of that is how can we develop new signal
processing theory and algorithms to capture these kind of new setting and
believe a new experience.
So shortly after we decide about this problem, and now it's become such a very
exciting time for us is there's a new kid on the blocks that provides not just
regular camera that has color information, but then there's a new type of sensing
capability that can measure in realtime video rate depth information.
So that provides us additional information, and this device now as, you know,
SoftKinetic is going to enrich the consumer space at a low cost. So, again, how
can we explore this to deliver this vision communication experience.
And, again, I'm fully aware that it's also something that's very actively researched
and done here at Microsoft Research. So I'm coming here to learn and look
forward to new opportunities to collaborate.
So let's first look at -- again, my background is signal processing, so first we look
at the image, we look at the signal, and one of the thing -- so that's one example,
the depth camera. We have one of these -- you know, when we start several
years ago it was very expensive. It's like a $10,000 PMD camera. Now we're so
excited that with the same amount of money we can get a hundred of those
connect cameras and play around with them.
But the fundamental problem is there is the depth information tenth to be noisy,
have a lower resolution, and there's some of the occlusion problem here because
of the way the data is measured.
So zooming in, again, you see those very important problems there.
So, again, as someone who has been working on image, video processing,
dealing a lot with linear image, now it's a very different type of data, and I'm
daunted because of, okay, what are the processing algorithm we can develop on
here.
So now go back to the vision that we set out earlier. Can we generate multiple
viewpoints in a free-view [phonetic] video based on some fixed recording.
So let's make the thing concrete. I'm going to start, you know, studying and look
at the problem in foreign setting. So we want to have a fixed set of cameras.
There's two color cameras. In the middle we put a depth camera, and we want to
record a 3D environment, right, a dynamic 3D scene.
And then based on this recording, we want to use a small number of cameras so
that it's very efficiently transmitted over the internet or store. We can
synthesize -- we would like to synthesize an arbitrary view and, you know, if we
have time we can freely move around here or you can have -- synthesize them
as a stereo pan [phonetic]. Then we can look environment in 3D at any viewing
angle. And, you know, we want to do it fast.
So that is a concrete setting, and when we look at this problem, and, again, we
got [inaudible], what are the key challenges?
One is the depth cameras that we deal with at that time, they tend to have very
low resolutions. So now how can we couple that with the existing -- the valuable
color image with high color quality, high resolution. So to combine that, that's
going to be one key issue.
And, you know, we trusted in the setting that camera can be dropped in very
easily without very complicated calibration techniques so the user doesn't have
to -- you know, the camera doesn't have to become be in the integrated system
so that, you know, it can [inaudible].
And the next one is at that time also, you know, parallel computing which now
becomes -- the mainstream is how can we deploy these kind of computing
algorithms in a parallel platform so that we can have, you know, realtime high
speed-up.
Okay. So let me jump in and explain our algorithm which brought ideas from
many different techniques. It's just our attempt to develop and deliver some of
these free view and then set up that as a framework to analyze the quality of the
possible reconstructed image.
Okay. So, again, here is the setting. So we have one depth camera in the
middle and two color cameras. So the algorithm is, you know, very trivial. First
because we have the depth information, so we can go back and then we can
synthesize what the depth information from this middle one to these two color
cameras.
Now, after we do that, again, the simple point why we did that is because then
now at those color camera positions we have both color and depth, and now
because we have that, we can exploit the rich high resolution, high quality of this
color image to help sort of the problem with occlusions, low resolution in these
[inaudible] depth images.
And then based on that we do processing and provide a very high quality depth
image here. So, again, the key point, how to combine depth with color.
And when we have this depth information per pixel at those color cameras, and
then we can, you know, do it by projection here and decide the visual level
viewpoint.
Okay. So that's the three steps. And, again, we purposely think about how to
map the whole thing into a GPU so that we can utilize all sorts of hardware
acceleration that's provided with GPU.
Okay. So, again, my background is in signal processing so I'm trusting first this
is a processing step so let me walk through this set of sequence of processing
steps that was provided to combine the depth and color into a high resolution
depth enhanced.
So first if we remember the first step is we take a depth from a fixed point and we
walk [phonetic] them to the color point, and the depth we use is very low
resolution so you can highly see -- so what you have is a very sparse set of
[inaudible].
And the color -- they are color coded here so that they show, you know -- I think
the darker the color, the closer the object, and vice-versa. But then, you know,
we know because of those points and we know that which object is closer than
the other, so then we can eliminate those -- you know, the wide point here have
to be occluded from those darker points, so then now we simply do this occlusion
removal and now we have this more accurate set of [inaudible] representing the
3D scene in front of us.
But, still, it's very poor quality. So now combining depth with color, that's where
the power coming in, and one of the very popular techniques, again, we pick this
one because we can then map them to GPU very fast. And we'll show some
number later. We quickly get interpolated, you know, few in terms of holes, and
higher resolution.
But, again, you see that a lot of this is problem's still here.
Again, you know, I was trained as an image processor, so to me the best way to
explain the algorithm is to go through the sequence of intermediate steps or the
detail of those steps, you know, later on I can show the reference and you can
look into detail, but let me just go through these images, try to deliver the point.
So the next step is, again, we still have those big holes in the middle because of
the occluded, so obviously when we draw them this [inaudible] we know that this
part have to belong to the background. So we know that, you know, that is a
background and using certain just simple 1D interpolations -- sorry,
extrapolations, that we can look at the [inaudible] and we can extend them out
and we can fill them. So that's another step. And now we're going to fill them in.
Without doing that -- sorry. Simply just filling this hole is just afternoon
interpolate, then you can smear them out, right? But with a little bit of
understanding about what is a 3D object underlying the scene, because now we
know the depth information, we can do a better job in filling these occluded
areas.
Okay. Now, after that, one thing we learn is for the depth image frontal view, we
have a very shallow boundary, but as soon as we move away that view, then
there's a term now people call the flying pixel, which is a pixel that, you know,
between the foreground, background, and that get -- when they measure depth,
they get average, so the -- and this is along this line here, and they are very pool
quality.
So when you wipe [phonetic] them over, these ones show up, so we have to
developing something that can clean up this. And if you look at this -- first, you
know, my students show me this and, you know, how can we clean up this? It's
a really hard problem.
But then we realize that if we have color information next to us because we have
some very nice color image here, then we know exactly at that viewpoint how we
can find out the exact edge so that we can do this processing the limited way, the
background way, the foreground way in between pixel.
So with that, then we can do some, you know, very simple trick, again, some very
simple and efficient techniques that can give us this clean image.
All right. So the example result, again, is not perfect because you can still see
some of these artifacts here, but what it shows here is, you know, we take two
image and the depth in the middle that has poor resolution, we can synthesize a
high quality depth map and then, based on that, we can render image at normal
viewpoint. Here is another perspective here.
Okay. And here is a video that we can fly through and see the object. Again, we
can synthesize any view, so you can actually see them as -- you can [inaudible] a
stereo pan and you can see thing in 3D.
And all of this can be done in realtime because -- I will there's a number that we
map that efficiently to GPU. I have some of my colleagues in one class
exploiting color and they have those two mapped out to a GPU.
Okay. As you can see that it's a [inaudible] of algorithms, and then one thing that
we've been looking into is, well, can we quantify accurately what is an error that
can result from this set of, you know, processing steps in synthesizing the final
image and using that to maybe give a prediction, you know, what the best
configuration of the scene, how to -- you know, in compressions, I have depth
and color, how to best allocate bits between depth and color to deliver the final
high quality synthesized viewpoint.
So that is a problem that we like to analyze. Of course now we switch gear now
to try to analyze it and then, you know, do a lot of simplification here so that it's -you know, we still make the problem tractable.
And we use the framework the propagation algorithm I just described and look at
the simplified version of that. So let me look at a very simple setting here first to
try to deliver the point.
So imagine that we have some, you know, surface, some object in 3D, and we at
this viewpoint here through propagation we have a set of depths corresponding
to this camera and the color pixel. So we have that set.
And now we want to synthesize the new viewpoint based on a set of those per
pixel -- depth per pixel viewpoint. And we can set the -- sorry. We can, you
know -- we can develop the setting which is, you know, [inaudible] surface. We
throw in that the camera have certain resolutions, the depth and texture of the
color image have certain accuracies. You know, this thing, again, it was -- based
on my training is, you know, let's say we do a coding with a certain bit rate, and
here's a certain distortion so we can, you know, quantify that. And now given
that, what is the best way to deliver a high quality image. Classical problem of,
you know, write allocation.
So given the setting here, you know, let me give you have the gist of this
approach. So here is our actual image, one single actual image. And here's our
virtual viewpoint where we need to synthesize the color image.
So what we do is we take this pixel, we know the depth, we know that, you know,
along this [inaudible] varies, and then we hit the surface and then we [inaudible]
them back here. So now we have a color pixel over here.
Now, what happened when this depth is noisy? So, you know, we would go
[inaudible] here. That noise is either due to noise from the measurement or
noise from the quantization due to coding. It go up somewhere here, and now
we walk now, so now the pixel now get -- move over here.
So that is -- we can quantify what is that [inaudible], right? And it's a simple
geometrical argument. You can [inaudible] that.
The next thing is now we look at the virtual viewpoint, and the virtual viewpoint
now, we have a set of pixels here, and the ideal function we want to reconstruct,
the ideal function we want to -- the ideal function we want to reconstruct is here.
But instead of having the pixel here due to the [inaudible] we have those pixel,
and then we have the measurement here.
So the noise coming in two part. One is the [inaudible] now gives you the wrong
location due to the depth measurement, and then the other one is due to the
color, so then you have this. And given this new set of [inaudible], we just do a
simple interpolation, right? Do simple interpolation.
Here we show you the linear interpolation, but the technique, you know -- let's
say you can pick up your favorite [inaudible] interpolation, for example, we can
bound that. And then we can bound what is the error of this reconstructed image
with the underlying original image. So we can bound that.
The next thing is now, remember that we don't just have one actual image, we
have a number of them, and then we would go to 3D scene and we propagate
back to a virtual viewpoint. So we have a collection of those, and they -- we can
use a random argument to characterize what the density of those functions at the
sample point. It turn out that we can, you know, write that as a closed-form
formula. And put all of those things together, then we have a final bound on what
is the -- we can characterize as a very classical technique in randomized,
random set, what is a moment of those difference in a sampling interval. After
we take those points, we walk over here.
Okay. Even those look very, you know, heavy, but actually we can compute
exactly those number. So given that, then we can put together now and we can
find exactly what is the error when we do the reconstruction. A number of terms
here. So let me explain -- so, you know, first this one here is telling us how
smooth the texture of the scene, right? So if the texture is very smooth, then the
error would be small.
The other one here is this one here is fully dependent on the configuration of the
camera in the scene. And then there's an error due to the measurement in
texture and then the measurement in depth, and they all [inaudible] into this.
And all of these terms here we can work out as a closed form.
Okay. Now, of course, the question is -- okay. Actually I just explained exactly
what those term have to be. So, you know, again recap here. This term, you
know, encode about the geometry of the cameras, you know, what is the scene
and how we place the cameras.
The human here is telling us about the sampling densities of the either pixel,
color, or depth. And, you know, other term decoding about what are the
accuracy of depth and color measurements.
So, again, this is giving us some particular analysis, and we can have a sense of,
well, even when to improve the technology, how the error -- how the
reconstruction error went to [inaudible] as we get better, better accuracy.
So even though that is a very messy formula, but actually, you know, in certain
kinds we can find exactly the closed form and can plot them out. And, again,
where I'll go here, trying to make the problem tractable with look at the what is
the certain behavior of this setting.
So here is a scenario which, you know, we have several camera putting a lot of
line, and we want to synthesize the intermediate viewpoints. Then, for example,
our critical prediction is how the error going to behave as we have more and
more samples, right? And we can synthesize the scenes. We have everything
exactly set up. Then we can measure actual error that tend to be, and we can
see that it follow very nicely with this physical bound and, again, accurately
reflect what we predict.
For example, this guy, the error decreased linearly as the number of sample
increased.
Of course, we can expand that to 3D, you know, by using our -- we can also
extend them to when there's certain occluded areas and we can note that those
boundary and we can also add in some additional terms.
Okay. So the main idea here, I guess, is, you know, we coupling a particular
technique, method, to synthesize the viewpoint, and we want to quantify what is
the accuracy of the reconstructed image based on the same configurations
based on the characteristics of the camera as provided to us.
Okay. So the next part -- oh, and I promise, this is slightly a bit out of order. The
timing we deal with is, you know, for example, we picked particular this algorithm
that, you know, if we run in CPU it would take a lot of time, but, you know, very
efficiently mapped with GPU because of the [inaudible] architecture, then we
have the saving, but certain algorithm, you know, would not give us that big
saving.
Overall several component we put together, we can achieve realtime rendering.
Okay. So that was, you know, some, you know, 2007, 2009, you know, some of
the initial algorithm, and if you notice carefully the data that I show earlier on was
completely scientific data. Right? We have a depth map, we synthesize -- well,
we have a 3D model. We synthesize what the depth map have to be, and then
we do the reconstruction.
So now comes the real thing -- you know, the real thing come later because
when this camera become available, you know, and when we can afford to buy
some. So we go ahead and have this simple setup here in our lab that have a
depth camera here and have a multiple number of color cameras here.
Somebody in the room know that, you know, you construct part of that setup.
There's some footprint here.
And then, you know, when we have these images, the problem we try to do here
is simulate the algorithm that I described earlier. So, namely, I use input as I, D,
and B, and I want to synthesize what would it be at C, okay?
So everything is real, and we try to set up and try to run the algorithm.
So, again, here's one of my students. There's three input, two left and right and
depth in the middle here. So just show the result quickly here. Left and right,
and here's a rendered view, and here's a ground truth.
Now, why these are important problems, as you can see, the goal here is we try
to be able to track where the eye gaze could be, and then want to synthesize the
view that the person look directly into the camera. So you can see both of these
original viewpoint a little bit off, but then we can synthesize the person's
[inaudible] on the camera.
So recognize certain area here along those boundary, and I will later show where
it come from and how can we deal with them.
>>: [inaudible]
>> Minh N. Do: Yes.
>>: [inaudible]
>> Minh N. Do: Okay. So everyone recognize that the depth camera we use
that time is very bad with the [inaudible] reflexes. Human hair tend to absorb a
lot of [inaudible], and I know the connect camera now, there's some [inaudible],
but, you know, for that camera, for example, dense hair, a lot of pixel getting
missing so that's the way we [inaudible].
All right. So, again, just the sequence of steps and how we process them. So,
you know, we propagate the depth, remove them, we -- you know, after having
these very coarse set of point cloud [phonetic] we can synthesize with higher
quality depth and we can fill in the hole by doing this occlusion, fill in and edge
enhancement. So we end up with a much higher, you know, per-pixel depth
quality here and then when we synthesize.
Now, the nice thing about this is because we have this, then the part that we
have, like, you know, along the edge here, we can easily move out and now we
extract, you know, the person from the background. Some of those artifacts are
due to the compressions.
Okay. So that is how we could do a real image and understand that when we
map the actual depth maps from an actual depth camera into, you know, other
color viewpoint and then each of those color viewpoint now, they would have fully
per-pixel depth plus color.
And we strongly believe that that would be a very efficient representation of the
data that -- with this data now we can ready to transmit and send them or store
them somewhere so that viewers, either remotely or later in time, can quickly,
you know, view the scene from different viewing angle.
Okay. So now come to the next problem is how can we -- yes?
>>: [inaudible]
>> Minh N. Do: No. We assume that it's only, you know, one depth per pixel.
>>: [inaudible]
>> Minh N. Do: Yes. So the whole filling, we realize -- let's say we don't have
one view here, but we have multiple views. And each views they have a, you
know, per-pixel color and depth. So we ->>: [inaudible]
>> Minh N. Do: We act like we have, you know, one single depth, multiple color,
but then we propagate them into processing and now from one depth camera we
can synthesize multiple depth viewpoint. And if it says occlusion, but then we
have color, so we can fill in those occluded area.
>>: [inaudible]
>> Minh N. Do: So, okay, maybe let me try to understand the point here. So
when we first propagate and [inaudible] them over, we have a lot of those
occluded area, but the -- let me clarify the setting here.
So how can we -- okay. So the setting we have is one single depth camera, and
we have two color camera at two viewpoints. So first we use propagations and
get the two depth cam. And, of course, each of those have a lot of those
occluded area.
But the key thing we have is at this viewpoint here we have color information,
and we use that information to guide how to fill those occluded area and then we
now, per color viewpoint here, we have a -- its full depth, you know.
>>: [inaudible]
>> Minh N. Do: Each color, yes.
>>: [inaudible]
>> Minh N. Do: Oh, the view in between, it just -- yeah.
>>: [inaudible]
>> Minh N. Do: So we limit that the view between that, you know, have to be --
any pixel here have to be either seen by one of these two cameras. So we limit
what the freedom of this virtual view can be.
>>: [inaudible]
>> Minh N. Do: Oh, yes. Certain kind, let's say, have some, like, you know, cell
occluded area, yes.
>>: [inaudible]
>> Minh N. Do: Oh, I see. I see. Yeah. Yeah.
>>: [inaudible]
>> Minh N. Do: Right. Right.
>>: [inaudible]
>> Minh N. Do: I see. I see.
>>: [inaudible]
>> Minh N. Do: I see. Yeah. We was thinking about, you know, we have, like,
you know, simple concave surface here, so let's say we look from here this
angle, certain part here get occluded and we have here now the total unit of
this ->>: You would really need to know, oh, this is a sphere
>> Minh N. Do: Sphere, yeah, yeah.
>>: [inaudible]
>> Minh N. Do: Or, you know, let's say I think maybe you have a, like, object, I
guess, that is convex surface, for example. Yeah. Yeah. Good point, yes.
So it is some, like, you know, very convoluted concave surface, and, yeah, we
have those.
So, okay, I think that's good refresh point here because now we assume that we
have multiple views, and each of them we have fully color plus per-pixel depth
after we've been kind of, you know, preprocessing.
And now with this data we transmit them and then, you know, we let the user, the
viewing side, simply, you know, do this viewpoint synthesis. Okay. And that
raise a very challenging and interesting questions and was tackled by Matthiew
here who is doing his Ph.D. at UIC and now working Microsoft.
So the key observation that Matthiew have here is now if we have a depth and a
color and assume that we do a post-processing -- sorry, pre-processing so that
we fill all the holes so we have a, you know, nice depth map here and color here,
how can we best jointly compress these two images?
So of course you can compress the depth and color separately, but then how can
we jointly?
And the observation we see is, you know, okay someone trained in wavelets and
[inaudible] here, most of the bits were spent on encoding the edge, the locations,
right, the significant coefficient. And what the two things share in common here
is really along these -- they have the same edge, right? And, again, when we do
this synthesis we really use the edge of the color to guide, you know, how to
perfect the edge of the depth.
So, you know, they are sharing the same, you know, locations. So that -- we
want to use that. And also want to show later that having explicit depth location
here also had the synthesized reconstruction, the synthesized viewpoint, the
view synthesis.
Okay. So given these two images, or, you know, they can be a video sequence,
but let's say I consider this as a, you know, coding them as I frame, then first we,
you know, detect the edge point, and then once we have this edge point we only
save them once. So we code them, like, using chain coding. And then
[inaudible] up we can efficiently encode the color of the depth given that edge
information.
So the idea is very cleverly constructed by Matt. We look at the lifting scheme
and the lifting, it is efficient way that, you know, people use in [inaudible] 2000.
You have a sequence of samples here. So think about one scan line of the
image. And you have a set of pixels here. You partition them into odd and even
and the odd sample can be predicted by the even sample, you know, by the two
nearby, but through that prediction, and we subtract them so those pixel
become -- tend to be very small, and then we can, you know, iterate that several
layer so that over here more of the coefficient would be, you know, zero or small
and then very few are going to be significant. So that's how we can get the
energy compaction.
Now, the key trouble here is now we do that, you know, typically leaves a whole
line of the image, but now we only, you know, look at the scan line. So we go
along here and here's the break point, and then another point here, and so along
this one here is it a very nicely approximated by a, you know, low-order
polynomial so that with the lifting scheme, a lot of the coefficient go to zero.
But then when they have an edge point here, that going to significantly give out
high magnitude coefficient. So how can we eliminate that given the knowledge
that we know exactly that is the edge location?
So the idea here is when we do the lifting here and we want to locally extend that
area, and these are the pixels that we have to insert them in. And the way that
[inaudible] simple -- they come up with, you know, models similar [inaudible]
polynomial, we can easily extrapolate those pixels and come with this very nice
closed-form formula for that, and we can just insert them here, and those pixels
there are computed by these existing guy here. And based on that, then we
have a lot of other coefficients would go to zero. No significant one.
So that was a key idea. And, you know, zooming this image -- so this, you know,
for [inaudible] it would be an ideal image model to compress. Even though with
that, you know, at a very, very bit rate we see a lot of these, you know,
compression artifacts.
Now, if we spend some bit it would take that overhead way of coding the
boundary here, and then the remaining bit we spend on coding the coefficient
now because of this exactly edge locations we know those go to zero, so we
have a lot of coefficients going to zero.
Now, with the send bit rate we have a much, much higher decoded image now.
Okay. Now, what does it mean? Well, taking this -- so here's the zoom-in
version here -- what does it mean is taking this per-pixel depth, a different
viewpoint, and now we synthesize normal view. Then because of all of this nasty
edge point here, when we synthesize it would create those visual artifacts.
Now, with the nice edge that we can encode it efficiently, then the synthesized
viewpoint now have a much better quality.
So, again, encoding now turn into the gain in the visual quality.
Okay. So let me come to the final, you know, project that we've been working on
is we recognize that certain depth cameras -- this is one of these Mesa Swiss
Ranger camera [inaudible] this is a prior connect error, very poor quality, poor
resolution, but then, you know, the color camera is cheap, we can get HD color
camera. We stick them together and the question is how can we improve the
depth measurement from this.
So you can see that, you know, if we have a connect, that will be exactly the
same setup. Or color next to a depth, fully register, how can we now provide
enhanced depth image.
So let me just go to the algorithm. The algorithm is very simple. We tried many
different approach and, you know, this problem also been looked at by many,
many other research groups.
So the one that we found work best is what we call a joint global mode filtering.
So the idea is the following: So assume that we have a color at pixel p here, we
have a color value, we have a depth value at that, and we want to enhance them
now.
And so there is a guided signal, so the depth and the color volume here, and this
function g here is just Gaussian so it's a localized, you know -- it's small, so if this
thing here code to zero, that have a high value, and if it's -- you know, they have
a lot of difference, then it's small value. So it give us sense of, like -- if you work
on bilateral filtering, you know, this is the things that you see that pop up
everywhere.
And what we got [inaudible] on is, you know, how similar these two guided signal,
how similar in location these pixels and then look at this value. And then we
build this histogram, and then we just find out the max value.
All right. So maybe explain by figure here and just give a sense that algorithm is
actually very simple.
So imagine that I have a noisy or low resolution depth map, and I want to know
what this pixel p here, what that depth have to be. That pixel p here could be
one of the existing location or one that we don't know. We want a sample.
And we simply look through a window around that, and in those pixel we know
what the -- you know, that have a known depth and color, and then we, you
know, move them over. So if this guy have a very similar color or very close by,
then, you know, collectively they have some high weight, and so that we have
that weight.
And then, you know, we do the same thing with the other pixel, and then we, you
know -- collectively we build the histogram for that and then we just sum them up
and we have a function, you know, using that weighted, and then we have that
function, and then we just pick out the max. So that's a simple idea.
It turn out that what we -- solving it, you know, some analysis is show that the
joint bilateral of sampling which is, you know, the bilateral filter version of
combining with depth and to up sample being proposed before, it's simply an L2
minimizations of this function where this one here becomes this L1 minimization,
and we found that it's much more robust and it makes more accurate.
So example results shown here. So here is [inaudible], you know. So holding
some image here, so they have color, the depth is poor quality and its nearest
neighbor interpolations, so you can see that a lot of these artifacts. And joint
bilateral up sampling can do this. When you're zooming in there's those -- you
know, because of this L2 averaging, L2 minimization, you can see that it's -- you
know, it's got blurring. L1, you know, as we know now, it gives a very robust
estimation and it give us sharp edges.
People also proposed, you know, very go very expensive using 3D which is
combining different [inaudible], but, you know, much more expensive than what
we have here, but still cannot compete with this visual quality.
So that's what, you know -- but then one more thing we realized -- is there any
question? Okay.
So one more thing we realized is, well, we can do that well per frame, but then
when we play the video, one thing we learned which is now become another
major topic in my research group is how to exploit this temporal, right, because,
remember, we're dealing with not just, you know, separate, you know, image of
frames, we're dealing with videos now. And [inaudible] the depth camera
sensors now that they provide us video rated information.
So the question, well, how to incorporate temporal dependency and consistency
has become a very interesting problem now.
So just, you know, give a sense of a simple fix we do, and I will show the video
shortly, is we look at the pixel here and then, you know, let's say we can assume
that we can reconstruct this one here. And now, you know, we look at the next
one here. It might also have a reconstruction.
And we first, you know, use a very simple, like, you know, look good candidate
optical flows, you know, get a simple -- find out what is optical flow, and based on
the patch similarity, which is a technique now people know in non-local means,
for example, very popular in de-noising and recover image. So we look at that
patch similarity here and give that the weight and then from there we can
influence, you know, what this pixel have to be. So very simple techniques, but
to us it work amazingly well.
And, you know, admit this is not the final result yet, but at least give us some, you
know, insight about, you know, how to exploit this, you know, temporal
consistency.
So let me show now the -- so this is original video that taking the depth map from
the low resolution camera and it simply do, you know, nearest neighbor
interpolation. Okay?
And if we do per frame, then -- per frame we can see that we have, you know,
high quality image, but then it's -- you know, the temporal consistency here, you
see that per frame? It still give that -- let me clear it to get that video.
So you can see those part here, and so, you know, I think view as the whole
video, you can see that there's a problem that come up.
Now, if we can just apply this, you know, very simple techniques in enforcing that
temporal consistency, then the edge are much more stable now.
But then we also realize that we run into the problem of how to synchronize color
and depth because this camera that we set up that they are triggered by two
different, you know, USB, so when the person moving, for example, there's this
[inaudible]. So that creates another issue.
Okay. So I hope -- you know, it's not yet a coherent story, but, you know, I think
that we just really searching, and as we deal with the real data we realize that,
you know, there's more and more interesting open problems we have to deal with
this particular type of data and sensing device.
So let me just conclude by saying that, you know, to me as someone working in
signal processing, it is really a paradise because we have multiple sensor, we
have different modalities, you know, depth, color, and so we -- I collaborate with
my colleagues in audio part, for example, detecting where is a person, extract a
person out, and then we can [inaudible] audio at that location, for example.
So this is got multiple modalities. Very exciting that we can exploit it.
And the [inaudible] so a lot of data, a lot of computations. Some of the
computations have to be done [inaudible], right, because of different node,
different cameras. And coupling with communication, how can we compress the
data.
So I think it open a lot of very interesting problems for signal processing. And the
applications are really exciting. So that really make us excited about that.
It has been coming -- you know, a number of papers here if you want to look at
the details of some of the work I present earlier on.
So I want to -- you know, as Cha mentioned earlier on, the depth camera is very
excited, you know, [inaudible] exercise. We play with them, we run them, but
then we also realize that, you know, you're going to also, you know, change the
way people going to run and deliver visual experience, communications.
So two years ago, you know, my collaborator and I, we formed together and
[inaudible] up a company, and let me just, you know, next minute or two quickly
show you some of the result we develop through that, you know, you can the
depth camera, and we want to provide a visual experience that, you know, much
more engaging, much more present for the people, you know, either remotely
giving a presentation or remote collaborations.
So you can -- you know, there's software that we develop that can be [inaudible]
with a connect camera. So just let me very briefly show that video and then I will
conclude with questions.
[Video played]
>> Narrator: In this tutorial you will learn to get started using stage presence, so
let's go ahead and give it a try.
Make sure that your connect or Nuvixa camera ->> Minh N. Do: So we develop software that running on a connect now,
something that we would not dream about like two years ago, which, you know,
everyone can buy a camera, a connect. Ten more millions units out there.
So the person here is -- so that's a typical scene here, and we can, you know,
come out the background. Of course, you know, that, you know, the depth
quality from the connect would have all the these artifacts along the object
boundary so we also know how to fix those artifacts. And just showing you an
example application, we can extract the person out, like, you know, they can be
in anywhere in the office, we can take the person out with a very nice, clear cut.
And then when we have that, we can drop the person into a presentation or --
you know, if we think about that, it's a [inaudible] chart that I can walk through
some of my desktop sharing, for example.
>>: [inaudible]
>> Minh N. Do: Yes. So that's the next thing, because we can now change the
eye gaze, so talking about, like, we can correct the viewpoint that the person
always look directly -- yes, yes. So that's our next step.
And, of course, you know, we can leverage on the depth that we can do a lot of
tracking. So the person can just stand up there, give a talk, you know, the viewer
will see the person be present, interact can the content. Yeah.
So, you know, what I'm really excited about is now this new system device give
us not just, you know, opportunity to do -- you know, research new problem but
also opportunity to develop, you know, real new applications that -- we release
this as sneak peek on our website, and it's the least -- in a month we have more
than a thousand download. People try, you know, people do them on, you know,
the typical application. We get [inaudible] people try to do for presentations.
People also upload the video on YouTube. We have people that they try to, you
know, develop software that people in rehabilitation, like they want to do their
exercise, they see themselves, you know, like riding bicycles or they see
themselves, you know, in the mountain. So a lot of these very interesting, you
know, potential application could be done with these depth-enabled devices.
Okay. So with that -- yeah, thank you very much for the attention.
[applause]
>> Minh N. Do: Any questions? Yes, please.
>>: Can you give some detail about how you synthesize the independent view?
So is it per pixel?
>> Minh N. Do: Yes. We synthesize per pixel, yes.
>>: So you talked about there's a surface somewhere.
>> Minh N. Do: Right.
>>: How do you get there and how do you get from there back to the image?
>> Minh N. Do: Oh, yeah. Sorry. So we assume that the scene, the surface, is
a [inaudible]. So at this pixel we have color and depth. So depth tells, you know,
how far that pixel in the scene, and then we go there and then with that depth
information, and we know what the other view, we can what we call is propagate
that pixel. The color now can copy over to the new view.
So that operation that allow us to -- with color alone, then, you know, we lost that
information, you know, and that's why stereo reconstruction very difficult
challenge. But now if I have a per-pixel, that information, then I can easily go to
the scene and I [inaudible] it over here.
But after that [inaudible] that, then, you know, there's a lot of occlusion, a lot of
holes, and that subsequent steps try to correct for that.
Yes?
>>: So you do not try to zoom in or change the position so that you do get no
gaps between the [inaudible]?
>> Minh N. Do: We can zoom in, yeah. So, of course, you cannot provide a lot
of, you know, zoom in because then, yeah, as you said, there's another -- density
of the new image now much coarser. Then we have to do more interpolation
to ->>: [inaudible]
>> Minh N. Do: Yes. So it's simple, just like the zoom. So you see when we do
that video, it's not just simply just, you know, turning sideways, but we can zoom
in, zoom out. You see that fly through. Of course, you know, if you zoom in too
much, like digital zoom, then you start having pixilated ->>: You are trying to build a service out of your depth points?
>> Minh N. Do: No, no, yeah. We want to avoid that because I -- you know, the
surface reconstruction is a very, you know, expensive operations, and then it also
make the problem to be, you know, then it become like some kind of
computer-generated surface. This one here, we want to capture the raw data
measurement and just only very simple processing that can be done in realtime.
Yes?
>>: [inaudible]
>> Minh N. Do: Yes.
>>: [inaudible]
>> Minh N. Do: Yeah. Great. Yeah. It's really, though, the missynchronization
between the color and depth. So we have these two camera that each of them,
you know, just give us 30 frame per seconds video, and then we just take the -you know, second [inaudible], you know, there's a [inaudible] or a skew between
those two, and then we just pick, you know, one color and one depth.
So when the object stays still, then there's no problem. But when there's moving
here -- so for the depth, the hand is here but the colors end over here. Then we
have that problem. We know that in the connect, there's no hardware
synchronization, but for the camera, the [inaudible] we try earlier on, it's very
simple. If the two cameras are on the same boat, they can be run by the same
clock and trigger that -- you know, the capturing. Then we can have a perfect
hardware synchronization. Then we have less of those problem.
>>: Sorry. Now that I know that, can you play the temporally stabilized video
again?
>> Minh N. Do: Sure.
>>: Thank you.
>> Minh N. Do: Yes.
So let's first see the one without. So -- yeah, so, you know, really the artifact we
like to remove is, you know, something like, you know, that [inaudible] come in
and out, but, you know, this is when we have a fast motion here, for example,
that would be trouble.
Okay. So this one here is the nearest interpolations, okay? So you can see that
is very poor resolution especially. And the final one which -- okay. How can
we -- interpolation. So, yeah, if there's a slow motion, then it is really good, but
when there's a fast motion, then that's where we have a lot of trouble. But, you
know, you see when this slow motion here -- oh, no, sorry, this is without.
Okay. And without, now you can see that it -- it got even worse. So we correct a
little bit, but, you know, we run it to the limit of the hardware.
>>: [inaudible]
>> Minh N. Do: Yes. So that temporal consistency that we do, you know, which
we use, yeah, is one attempt. Maybe we have to -- think about this. Maybe we
have to not just use one previous frame but maybe multiple frames, something
that people encoded that know, you know, multiple reference could get a better
prediction, for example. So, of course, it could improve the quality, but with some
of the complexity, you know, pay off.
But this certain -- we was very happy when we see a single image, but then
when we play the video we realize that, yes, that's become now a major issue. A
lot of algorithm, if you read all in the literature, [inaudible] goes on pair of image
in color and depth. So I think the depth information now, the ability to capture
and record in realtime, got this video, I think it creates this interesting new
problem, you know, how to enforce consistency across time for the depth video.
>>: I have another question. When you warp the depths to the texture point of
view, there are certain areas the depth does not have information, and you
mentioned using the interpolation without having any under-surface assumptions
>> Minh N. Do: Right.
>>: You do propagation, but what -- what is the result eventually if you have a
gap in the occluded area? What does it result look like? Is it more curved or is it
more like [inaudible] type of thing given your current [inaudible]?
>> Minh N. Do: Yeah. So the -- again, remember, there's a picture that we ->>: [inaudible] that curvature continues
>> Minh N. Do: Oh, I see. I see. Yes. Yes.
>>: It could also be a plane. So what kind of result -- eventually if you get the
interpolation you know [inaudible] turn around to see what it looks like.
>> Minh N. Do: Yeah. Yeah. Yeah, I think that would be a big challenge. We
only -- you know, let's say, you know, a small occluded area, if we don't have
color information, that is very challenging, but now we have color and we use a
little bit of the [inaudible] for the color, this sharp edge the color image provide,
we can fill in. But, yeah, certainly we -- I think we try. I think we have a last
occluded area, you know -- some artifacts that's showing up.
And when you see the video, when, for example, the viewing, the fly-through, the
viewpoint change slightly, then we don't see that. But as it's move around, it start
those artifact pop up.
So I think we -- maybe we -- after that understanding, of course, is a challenge to
go ahead and fix those problem, you know, so that we can have more flexibility.
But also we realize that, you know, if we have small view corrections, then the
thing [inaudible] quite robust, and maybe the application for that -- that's why I
show with the real data, if the application is just really video communication, like
the typical camera here, and now, which is slightly changed the eye gaze so that
the camera, the virtual camera, behind the scene directly look into the eye, then
that can be done, you know, effectively. Yeah.
So not completely free viewpoint, but slightly changed, I think that could be
doable, yeah.
Yes?
>>: I have a question about how you synthesize these novel views. When you're
rendering a new view, do you work the depth and color information from a single
camera or do you work from two or three nearby cameras?
>> Minh N. Do: Yeah. Great question. Yeah, we work from multiple viewpoints,
yeah. And, you know, there's already a lot of literature techniques people
propose when you work multiple viewpoint and then how you resolve the conflict,
you know.
So, again, there's some information from the depth that it can show that which
pixel people would use, one which they discard after they walk through the virtual
viewpoint and then some of the techniques how to fill in some of the, you know,
missing pixels, some of the holes.
The reason we use more than one view is exactly some question that, you know,
was missing. If there's object that, you know -- with single viewpoint then is
occlusion, then, you know, we cannot, but now there's another one. Then
hopefully that awkward area would be visible by another point, so then it fill in.
When there's a viewpoint that was seen by both, you know, we lose some type
robust, you know, interpolation so it doesn't smear out. Yeah. But that
technique, there's already well-established literature on that. Actually there's
standard [inaudible] on how to do these view synthesis given there's a common
depth either single viewpoint or multiple viewpoints.
Yes.
>>: [inaudible]
>> Minh N. Do: Right.
>>: [inaudible]
>> Minh N. Do: Right, right, right.
>>: [inaudible]
>> Minh N. Do: Right, right.
>>: [inaudible]
>> Minh N. Do: Yes, yes.
>>: [inaudible]
>> Minh N. Do: Right, right.
>>: [inaudible]
>> Minh N. Do: Yes.
>>: [inaudible]
>> Minh N. Do: Great question. Yes. We didn't actually test with the connect,
but, you know, we -- that's why we did the other problems that I showed last,
which is try to attempt exactly the problem you mentioned.
The depth is poor quality, you know, so then when you encode, you encode a lot
of those noise and you pay dearly for that. So now if the [inaudible] have
connect, connect not just have a depth, but there's a color next to it, and in the
color we can use color information to fix those, you know, holes or occluded
area. And then now we have a pair of, you know, post-process -- I'm sorry,
pre-process. So, you know, the proposal we think is take those raw, let's say a
pair of color depth by connect, use the color image, fix those, fill in those pixels
and have a clean edge map and then use that processed image for encoding.
So then we can have about 70 bit rate as well as, you know, enhanced image
quality.
So, yeah, not mean to just throw the data [inaudible] into the encoder. That
would not work.
>>: I'm curious just to see [inaudible]
>> Minh N. Do: Right.
>>: [inaudible]
>> Minh N. Do: So, for example, that application here, what we show is -- let's
say one of the object -- you can see that the way you can cut out object ->>: [inaudible]
>> Minh N. Do: Yes.
>>: [inaudible]
>> Minh N. Do: Yes. You can see that we cut out object boundary very accurate
now, and we do that with a connect camera. Why? Because the depth. But you
know the depth if you just do that, you know, cutout that is very noisy. But we
have a color, so we have a technique that do the realtime that cut that very nicely
out.
And we have that now. That information now is valuable about depth and color,
and then we can use that to do future encoding.
>>: [inaudible]
>> Minh N. Do: Right, right, yes.
>>: [inaudible]
>> Minh N. Do: [inaudible].
>>: You're going to have lots of holes everywhere. That's going to [inaudible]
>> Minh N. Do: So -- so Matt here can elaborate more, but the key, we don't
have to find out all those edges. We only find some of the key edges, and that is
expensive, and then we just use that information to guide the encoder. The more
we have, the better, but, you know, you can just get a few of them that already
have it to reduce the bit rate.
Now, the rest we have to, you know, spend more bits on coding the residual. So,
you know, like encoding, if you don't do a good job in most estimation, the
residual will catch up or, you know, you have poor quality, but, yeah, it is not
going to fail miserably, like it's just degradedly.
>>: So from a coding point of view, you want to duplicate the data on the other
side of this [inaudible]
>> Minh N. Do: Yes.
>>: Seems you have already put some process on the image. So you save
some bits but then [inaudible]
>> Minh N. Do: Right, right.
>>: [inaudible]
>> Minh N. Do: Great point. I'll explain. We realize that up-sample, remember,
when you up-sample, the whole thing still a low-order polynomial, and even
though, you know, you can have they loud signal, but if it's a piecewise
[inaudible] and we know where the boundary, wavelet will just eat them very
easily, you know. So in wavelets what happened is it do those transforms so,
you know, go back to the very, low, low, low resolution, even much lower than
the original, and then the rest is remained in high frequency or high sub n
coefficient [inaudible]. So, yeah, it is [inaudible] you have cleaned them up, but
when you encode them, [inaudible] take care of that.
So the bit stream, you know, in the end, it is very small. So people actually
learned that, you know, you take an image, you clean them up, you know, it turn
out that much more efficient encode than the clean-up image and the original.
Yeah.
>> Cha Zhang: Let's thank the speaker.
>> Minh N. Do: Thanks.
[applause]
Download