Document 17955756

advertisement
>> Eyal Ofek: Hi. It's my pleasure to invite Gaurav Chaurasia for a
talk here. Gaurav is finishing his Ph.D. under this supervision of
George Drettakis in INRIA area of image-based rendering. Gaurav.
>> Gaurav Chaurasia: Thanks a lot for the introduction and a very good
afternoon to everybody. I will be talking about my work on image-based
rendering.
So let's start from the basics. The basic pipeline in all of graphics
is to first to scene modeling you create a model of the scene, then you
need to do some editing of the materials, the BDIFs, design the
illumination of the scene and that's when you get the final result.
But all of these steps require intervention of trained artists
depending upon how good an image you want to get that brings
image-based rendering into the picture, because anybody can take images
and if you could render the scene just from images, that would be
really cool.
So the problem with image-based rendering is that you can create novel
views from input images only if you reconstruct the whole scene exactly
from your input images. That turns out to be a very hard problem and
is active research in reconstructing geometry, the materials and
lighting of the scene but you can only do approximate novel views.
So the challenge with IBR rendering is basically how do we create an
approximate novel view and what is an approximate novel view.
So I'll try to answer all of these questions in my thesis, my work is
mostly focused on the first part which is how to create approximate
novel views, and I was lucky to be part of the project and the team
which were on perception analysis of image-based rendering.
So let me start by explaining the problem statement in more detail. We
are trying to do image-based rendering of urban scenarios. We're
looking at datasets, at images which have a lot of architecture, trees,
cars, geometry.
We're trying to use hand-held images captured with hand-held cameras
only. We don't want to depend upon laser scans. If we had this data,
it would be great, but we don't want to rely on dense data.
We would like to do this in as few images as possible. The example you
see here is an example of a dense capture, lots of images captured of
the scene. They'd like to do this with sparser and sparser captures,
because that allows us to do very large scale image-based renderings of
really huge datasets. And lastly we'll try to approach the problem of
three viewpoint navigation. Most existing work interpolation where you
jump from one input view to other input views, but doing free view
point navigation is a much harder problem and we'd like to start
working in this direction.
So brief overview of image-based rendering started with light and
image-based modeling way back in '96. There's been a lot of work as I
said on view interpolation.
This field has spawned a number of other applications like up sampling
of videos and camera stabilization. As I said, all of my work is
focused on IBR of urban scenes. And we have a lot of commercial
products in this field already. And all of these products use a very
basic form of image-based rendering which you have one plane, you jump
from image one to image two, do basic blending and approach this
problem as well as possible.
A standard approach is to reconstruct the geometry of the scene and
then any pixel in the target view can be recreated by back projecting
on the geometry and reprojecting into the input views. And the
earliest work was indeed based on this approach, and this actually
works really well if your geometry is good.
As it turns out the geometry of urban scenes and for that matter any
dataset, with enough complexity, is not completely accurate. And
that's why most the recent work has focused on alleviating the
artifacts because of slightly inaccurate geometry.
For example, in this work optical flow is used to fix ghosting
artifacts. You've scheduled input images on to proxy. If there is
alignment issue, opti flow can realign them back. This works well if
your alignment is off by a few pixels. More recently this was the work
done partly at MSR. Ambient point clouds are used for unreconstructed
areas of the image. The main object of interest is very well
constructed as you can see here and the background and the can fade in
and out.
We tried to multiple scenario. We know these approaches can give you
sparse and noisy data if you don't have enough texture in your input
images. If the texture is too random, too stochastic. If the geometry
is too complex.
And all of these issues are aggravated if your input images are wide
baseline. If the spacing between input images is a lot and the overlap
is quite little.
So for point of view of IBR, this is an input image and a depth map
which we got from existing multiple zero techniques. The problem is it
can require dense in some regions, for example, the trees is almost not
there. Few of depth examples and collusion boundaries are simply not
there with the depth maps.
With this we now know that the geometry estimation does not give us
dense data everywhere. And for such areas we will try to use
image-based approximation. We'll try to use image warping and the two
main tissues are of course collusion boundaries and depth samples which
we're missing.
Why image warping? Image warping because it has been shown to work
well for a small number of points. If you have a few pixels which can
be reprojected into novel view you can warp your whole image with this
in a Lees squared sense.
So all of this was inspired by the camera stabilization paper of Liu et
al, and we improved that to use a shape preserving board as I'll
explain afterwards. And it's a more intuitive way of controlling how
the input image is converted to a novel view. You can use depth to do
that. But the effect of depth is more indirect. You don't know
exact -- you can't understand the artifact you'll get because of noisy
depth. But if you design your algorithm on the basis of image warps,
this is a more intuitive way of controlling how the image will be
projected into the target view.
So I'll go first over my first project more briefly. This project was
at EGSR a couple of years back. And this release is proof of concept
that image warping indeed is useful. So we start with a bunch of input
images. We reconstruct the depth maps for each of the input images by
using standard off-the-shelf multi-based approaches and we mark all
occlusion boundaries approximately by -- and we ask the user to mark
the occlusion boundaries. I'll explain why we do this afterwards.
So with this data we have depth samples and I say we're going to use
the depth samples for constraints for image warping. Each pixel shown
here in gray can be projected into the target view. All the white
areas cannot be projected because they have no depth.
At this moment, if you overlay a big warp on top of this, a grid and
use each of the gray pixels to reproject the whole mesh, then you can
probably handle the empty regions and that's what the shape preserving
work is all about. We have the reprojection constraint which comes
from depth sample. We overlay a big sort of [inaudible] over the top
of the image 2-D image. Some pixels as shown here have depth. We can
reproject these into any target view and we can warp the whole mesh
along with it toward the mesh we have to hold the whole mesh together
by some type of 2D constraint that we use as a shape preserving
constraint which enforces that each triangles word of the mesh is only
allowed to undergo a translation on a image plan rotation on isotropic
scale. The end effect is each triangle does not undergo arbitrary
defined formations and that preserves the local shape of the word mesh
but this as you would, you would notice is a smooth image work and this
will be a problem at occlusion boundaries because the background will
move at a different speed compared to foreground and this indeed is a
problem as you can see in this not well done illustration. If you have
some points in the foreground in green and some in background in red
and you move the novel camera in one direction, the point will try to
fold over each other, which is the case of occlusion when the
foreground occludes the background and the same thing happens if you
move the camera in the other direction.
In this case, the word mesh will stretch because the background will
move in a different direction than the foreground. And this will lead
to artifacts as shown here. This is an input image moving the camera
in opposite directions will lead to these kind of agreed distortions.
To handle this we embed a discontinuity in the warp mesh this is done
in the form of elastic bed we mold them approximately by hand so we
have these very nice edges around each foreground object. We insert a
narrow band of triangle around these, around the foreground objects and
inside these triangles we do not apply the shape preserving constraint.
This makes sure that this area is allowed to deform arbitrarily but the
rest of the mesh does have the shape preserving constraint and that
space, the way it should. The end effect is the part above this
elastic band shown here does not affect the part below and these two
parts can move in opposite directions or they can even fold over. And
this constraint was the main contribution of the paper. This allowed
this showed us that you can even use smooth image warps for doing a
discontinuous sort of effect. At the same time, you can maintain the
shape of the foreground object. Imagine an object which has a right
angle, horizontal and vertical edge and you want to make sure that this
right angle stays the same while you're warping your image to novel
view. To constrain that we constrain the angle between any two
neighboring edges of the foreground object. With these foreconstraints
we get the full image warp in function and we use and we solve it in
Lees square sense and that, the effect of all of this is the earlier
distortions which you saw are now contained inside the elastic band
which we marked earlier. So the red region shown here was in the
beginning just one pixel-wide but warping the image to different places
allows all the distortion to be absorbed in that narrow band and we
know where distortion is we have images to field content into these
areas. So now I'll go with a rendering pipeline. I hope you can see
the input cameras. Okay. So the input cameras are all shown in green
target view for each target view we pick a few input images, warp them
and blend them. I won't go into the details of blending. So here are
some results. So on that side you'll see results of earlier
approaches. That's unstructured lumigraph with the best quality model
we could obtain, and you will notice that we get ghosting artifacts
there because the model was not perfect. And here we don't have the
same ghosting artifacts.
The results aren't perfect but we manage to improve significantly on
occlusion boundaries. In this case you'll see the whole tree moves at
least as one block. And it's not broken up completely. All the novel
camera parts here are not quite view interpolation. But almost very
close. They're not very free viewpoint yet and the reason they're not,
we could not do completely free viewpoint rendering is because we used
a global image warp and this led to distortions when you move too far
in or too far out. So I'll show some comparisons of these distortions
afterwards.
The major limitations were first it was a global image warp because,
and this led to distortions which I'll explain later, and second was
that we had to ask the user to mark all the edges manually and this led
to a problem.
I'm sorry?
>>: In addition to marking just manually do you also have to somehow
specify the Z order so the tree moves in front of the building or is
that automatic.
>> Gaurav Chaurasia: No that's automatic because do you have a
reconstruction. So you can use that to infer the ordering of things.
And as you can see, this would be a painful process. If you have to do
the same thing for every single image of every single dataset makes
this approach not very practical.
And being a global image warp it was real time. But the system was
quite big. If you're trying to solve a warp mesh for the whole system
it became quite big and mostly because we had to embed these elastic
bands. It was a conformnal Delaunay
triangulation and it made it
harder. It had numerical issues and other problems and lastly as I
said we had distortions.
So to address these issues, we did the paper which I'll present at
SIGGRAPH in a few days mainly the main targets were to do something
completely automatic not depend on intervention, be real time, make it
easy.
So for this we'll again use two base image approximation to get
occlusion boundaries we use image over segmentation. Divide the image
into lots of super pixels and these pixels are, they always capture all
the occlusion boundaries. They have lots of other edges, but we can
handle that, because we're using a shape preserving warp. Explain that
in a bit. To handle areas with no depth, we add some fake depth,
approximate depth in empty area. It's important to note that the depth
we add is not for consistent, it's only useful for doing these
proximity image warps. You cannot use this depth to help you in
surface reconstruction.
Third step is largely the same as image warp, but this time it will be
a local warp applied on each pixel individually and this reduces the
size of the warp and makes it much more easier to actually solve. It
becomes much, much more lightweight. Last is adaptive blending where
we try to alleviate artifacts which arise because of ghosting and
popping.
So the preprocess is likely the same. We again use multi-based area to
reconstruct a depth map for each input image. The second step has
changed now instead of applying -- instead of asking the user to mark
it we use image segmentation and from this point onward it will upgrade
to super pixels as the basic building block.
The next step is to add depth into empty regions. Here's a depth map
and super pixel segmentation. If you overlay these two things I hope
you can see it over there that you have lots of areas which have lots
of depth. That's all grand. That's all fine. There's also super
pixels that have no depth which appear in white we cannot reproject
them in any way in novel view if you bought pixels individually.
To handle such cases we can find depth from other super pixels which
have the same imagine content and are spatially close. So let's try to
mark all of these problematic areas in green and look at any one of
them, for example. So we have these two ideas to find approximate
depth, spatial proximity and same visual content. First let's try to
find the pixel that has the same visual content. These could be
anywhere. If we rank pixels by the order of similarity in visual
content, you found that the rank one, two, three, four, 5, were always
in sort of very different parts of the image.
You cannot completely rely on this. You had to incorporate spatial
proximity. The problem with this is if you put spatial proximity
across the image plane and RGB difference into one weighted distance
metric you'll have to adjust the weights every time time for each image
and for each dataset. So to solve this we use a super pixel graph. We
have super pixels like this. We consider each pixel as a node in the
graph and the edge between two super pixels is the distance between the
histograms the RGB histograms. And what this basically means is that
an edge between two super pixels on the wall will be quite small
because they have the same visual content. And two super pixels on the
tree will also be low. But anything between the wall and the tree will
be quite high. And histograms give you a nice way of evaluating this
metric, because the shapes and sizes of pixels are quite arbitrary.
But their content is supposed to be very homogenous.
So with this, you now have a graph of your whole image and you're
trying to find the best neighbor of the targets where pixels are shown
in red over here. You want to refine, you want to choose a few of
these yellow guys where to you want to find where to choose depth from.
Let's zoom into this area.
And try to find any path from one of the red guys to one of the yellow
guys, the path can be from anywhere in the image. And you can imagine
the shortest path will be something that jumps over super pixels which
always have the same visual content. If you stayed within the tree
always, that part cost will be small. If you ever jumped outside the
tree, you'll immediately incur a big edge weight over there.
So we retain the -- we rank all of
path cost and we can find shortest
retain those which have the really
same visual content because that's
also closed in the image. They're
object.
these yellow guys by their shortest
path for all the yellow guys and we
smallest cost. And these have the
what we started with. And they're
also more or less on the same scene
Now, it's important to understand that the reason we are choosing these
three is not because they're close in Euclidian sense, they're close on
this graph. This algorithm will converge to Euclidian sense if you had
depth in enabling super pixels, but if you did not, it could go and
find depth from a relatively far off place without trying to find depth
with the red guy from the wall which is much closer, actually. So at
this point we now have a super pixel where you wanted to add some
depth. We found some candidates for it which have some depth samples
and we can interpolate these depth samples to add one pixel with depth
into the target super pixel. So and we can -- I won't go into the
details of this interpolation.
And we can add a few depth samples. You could go on and add depth with
every super pixels but that's not required because we're about to use a
shape preserving board and I'll explain why we need that.
We started with something like this with sparse depth and we added some
of these samples artificially. Now we have something in every super
pixels and we're good to go the next step which is image warping. So
if now that you have depth samples in every super pixel you could
triangulate and project each of the vertices to get a target view.
This will lead to problems because we added these approximate samples
and even if we look at the original samples, they can be a bit of
noise, a bit of inaccuracies. So the effect of this tiny bit of noise
can be seen on the balcony part, which was well reconstructed, but
depth samples do carry a little bit of numerical errors. And in the
top part, you'll see problems because we added approximate depth same
on the tree and the bush.
The result of our shape preserving warp is that it does not leave as
many cracks. Simply because we regularize the effect of depth samples.
We're not -- it's a Lees squared smoothing, and as I said before the
shape preserving warp gives it a straight forward intuitive way of
controlling the effect of these depth samples. So let's jump into the
warp itself.
As I said, these are the areas where we actually added the depth
samples in the previous depth and that's where we can see the best
effect of the shape preserving warp. So the warp is exactly the same,
the same reprojection constraints but within each super pixel be
projected into the target view and you can warp the super pixel like
that. The shape preserving is also the same. We tried to make sure
that each warp -- each face of the warp mesh does not deform too much
and with this we can now choose four input images to blend to work for
each target view which you can see on one side and the final blended
result on the other side.
And let's now just take a brief look at the adaptive blending step.
The basic weights for blending images for image-based rendering comes
from unstructured lumigraph a very old approach but still holds good
these weights are based on orientation of choosing cameras. The
problem is they blend too aggressively and in other situations you
always get ghosting artifacts if you blend too much and this is
something that we found in another work we found in a little bit.
So to handle this, to handle this notion of adaptive blending, which
means we want to blend only when absolutely necessary, we use this idea
of super pixel neighborhood across images so let's assume these two
images zoomed in area of two images and two super pixels just mount
artificially over there these two pixels will be considered neighbors
if the depth samples are one reproject into the other. This means that
there's some sign of correspondences between these two super pixels.
They belong to the same part of the same scene object. If that is the
case, it's probably okay to blend them. If at any pixel target we are
trying to blend two super pixels, which are not neighbors and hence do
not belong to the same region of the same object, of the scene object,
I beg your pardon, it's not a good idea to blend them because you will
end up with ghosting artifacts. In such cases we favor one candidate
quite heavily, increase its blending weight artificially and that
reduces the amount of ghosting quite a lot. And lastly for any small
cracks left we just use basic hole filling to get the final result.
Now I'll just show some free point results. In these cases you can see
that we're actually moving the novel camera shown in red quite far from
the input images. And in this case really far from the input images.
This is taken from across the road from in between on the sidewalk.
And the images which are warped are shown in blue. And you can see
that we do get padlacks effects even though we move quite far from the
input cameras.
That's the main goal of this project. We experimented with our
approach on a number of input scenes. We can find more details in the
videos and paper. And I would request you do that. So we're not
compared to existing approaches. This is come by floating textures
uses optical flow to alleviate ghosting artifacts. In this case you
can see it was not designed for cases where the artifacts are really
big so it does not converge as well as it should have. And in this one
we compare with ambient point clouds which uses this hazy point cloud
for unreconstructed areas. Again, this was not designed for cases
where your unreconstructed areas are actually on the main scene object,
and then it looks like an artifact. And it's also important to note
that a view interpolation technique doesn't handle free view
navigation. Lastly we compared to our own approach the one I explained
before and I said we get distortions when we move very far away from
the input cameras as you can see in these cases, and it's also
important to note that the new approach is completely automatic, with
the old one required a lot of manual intervention aside from being a
bit slower and heavier for the machine.
So the main limitation of our approach is that we do not handle sharp
depth gradients within super pixels. We assume that each super pixel
does not have a big depth gradient. If you're looking at any surface
of the scene at grazing angle that any super pixels on that surface
will have a sharp gradient. If you try to warp this it will warp more
in one direction than the other direction. And the isotropic scale of
the shape preserving warp breaks down that's when we get artifacts. We
like to solve this issue by redesigning the shape preserving warp
itself and that's not completely inconceivable.
So with this I'm done with one section, which was the main work of -the main two projects of the Ph.D.. I'll jump into two other projects
where I was a co-author. So these were to understand perceptual issues
in image-based rendering. The first paper was to understand the age
old question of blending versus popping. So you'll notice that in both
of these papers, we use very simple IBR scenario. We use simple scenes
which have just a facade and a balcony in front. The proxy for the
scene was just one big plane, as it is in applications like Street
View. And lastly the IBR approach used is probably the oldest one
ever, just apply texture map. And the reason we use this very
simplified setting was to make sure we have complete control on the
studies. We have no other variables. Obviously this has to be
improved and [indiscernible] to arbitrary scenes but we had to start
somewhere.
So in this paper we'll try to understand this issue of blending versus
popping. So that's basically what Street View does. You have panorama
that at discrete places and novel views are constructed by choosing
contributions from the two images, the scene geometry could be
anything, but the proxy's always a flat plane.
So in this study we showed people image-based rendering results and a
reference video along the same path, and they were asked to rate the
level of artifacts in these videos. So, for example, in this case,
there's no blending. So you'll notice that the edge of the balcony
really pops. And the popping is a big distance because it's a sparse
capture. There's few input images it jumps over large distances.
The second one, the popping is smaller but more frequent, because you
have more input images. Some other case in terms of blending, we again
have a sparse capture where you can notice lots of hazy effects,
especially on the banners over there and lastly on this door. And part
of the window, if you can see over there, has this big ghost, which is
really in two different places simply because of the sparse capture.
And the dense capture would bring the ghosts really close and merge
them as you can see in this case. So we showed all of these one by one
to the user, along with the reference view and ask them to rate all of
this. So you've got a strong guideline that people like seeing a sharp
image which pops a little bit compared to a ghosted image which is just
always has low frequency, which is missing high frequency details, for
example, in this case you are not using image one, and then pops to
image two at some point then goes to image two.
People like this more than something like this, which is a little a bit
hazy. So you don't see the big artifact, the big jump but it's always
a little bit bloody. In the next experiment, we tried to understand
what could be a compromise between blending and popping. We use what
is known as cross-fade where you use image one up to some point and
fade between image one and two. And then you use image two. This
turns out to be a good compromise between -- and the other cases,
popping and blending are extreme forms of cross-fade. A pop is a
cross-fade of length zero. Blend is cross-fade of length complete
path. So we again showed all of these with the reference views and we
got a strong guideline that cross fades for long durations are less
acceptable than cross fades for shorter duration.
Again for the same reason, people like looking at a crisp image with a
small sort of change and again a crisp image. But this also shows that
people are not good at noticing a perspective distortion.
In this case you can see it's distorting, it changes, and it's
distorted again. People don't seem to mind that or they don't notice
that.
That's the reason we did the next project which was to understand this
issue of process peck I have distortion. Now there's no blending no
popping, just a single input image, single proxy, and we project this
on to the proxy, view it from different places, and we should see
prospective distortions this paper will be at SIGGRAPH this year again
in a few days.
So, for example, in this case that does not look like a right angle.
This is a single image mapped on to a flat plane. And the other one
looks slightly more acceptable.
What's the limit? At what point can we say that the user will object
to it, at what point the user will say, yeah, that does not look like a
right angle? So the two experiments were to understand this idea of
distortions. So we showed in experiment one we showed images or these
still images of facades with balconies and we asked the user what do
you think that angle is.
The way they answered this was using a hinge device. It took us a long
time to figure out how to ask this question, because if you put
sliders, people will make mistakes all the time. So this turned out to
be a nice way. People would see it and actually replicate the angle in
front of them. So we recorded all of this data and in experiment two
we asked a simple question. Does that look like a right angle to you.
And people were asked to rate it from 1 to 5. And both experiments
should give us data that matches. And that's exactly what we got. I
won't go into how we deal with this all of this data. But the end
result of all of this was that we now have a model of predicting the
level of distortions for any novel view position. For example, in this
case on the far right you'll see a copy of the scene with the capture
camera and the proxy as used. The two dots over there are, they are
these two edges of the exclusion. And you will notice now that the
user is trying to escape his place. Now the user is trying to sketch
an IBR path over there and red means bad distortions and blue means
acceptable. So this -- and we can decide the rating of this path using
the data we already have. This is one of the first ways of predicting
how bad the distortion will be without actually running the application
itself. So I'll just let it go, again, in the beginning you can see
it's quite bad and then it gets acceptable. It's a very big leap path
and then it gets to be okay. So what's the use of this? We can use
this prior knowledge to restrict where does the user put the navigation
zone of users. For example, in this case we can only allow user to
translate in acceptable -- you'll get a blink effect where we stop the
navigation at that point. Don't allow the user to go any further. And
we can do the same thing for turning. This allows the application to
restrict navigation into acceptable zones. This is something we do in
games a lot. We restrict the movement of people within a fair zone.
So the conclusions of all of this work, first we should focus on
fee-born navigation. We interpolate for a long time. That's something
we can handle quite well. It's about time we jump to the next level.
Second important idea that we want to push forward is we are trying to
do image-based rendering which is approximate by definition. So we can
make approximations over reconstruction, over data we get from
reconstruction to do something more than that.
Thirdly, because we were using some approximate data above depth, we
should be using some form of regularization in the, maybe in the form
of image warps or something else rather than just directly reprojecting
this data into overall views. Lastly IBR choices should be based on
perceptual analysis and not what the first author thinks is good
looking.
So this work is actually being used in an immersive space with stereo
in a European project. This is actually work in progress. We are
porting everything to the cave. In the future this case will be the
starting application used for a European project which will work on
using IBR for back drops of games. So there's all of this code is
already where the games come completely working their way out of it.
Open questions image-based rendering how do we use machine learning how
do we exploit all this information we get from machine learning,
there's a lot of work on semantic labeling of images and we simply are
ignoring all of this. Can we use this data to improve results?
Secondly, they have all these nice ideas for image-based rendering
which have been around for quite a while. The interpolation work by
Zipnick [phonetic] and the work on reflections last year and wide
baseline and stuff. Is there an approach that can bring these nice
ideas together? I'm not talking about a switch case statement where
you do this when in this case and/or do something else in a different
case, but is there a nice way to combine the advantages of all of these
approaches. Thirdly, IBR has been studied in so many different
contexts it's hard to compare one context with the other context.
Light field for example is one end where you have hundreds of images,
one object, very structured capture, you can do amazing things with
this. Street level IBR two images for a whole building. Not much you
can do it. Is there a way to go smoothly from one of the horizon to
the other end? And how would artistic effects, we're trying to make
things look good. We can use artistic effects to hide artifacts and
give nice or immersive experiences and some work is already there in
the form of ambient clouds which use nice effects to haze out the whole
background. And every time we write an image-based rendering paper we
always say it's real time, it's all very good. But the real question
to ask is does this actually work on a mobile device? If the answer is
no we have room for improving the efficiency of these approaches, make
them very light, that's the only place where this can be useful for.
And lastly we can use image-based rendering if all this works we can
then use it as a nice tool for what your reality of your games where
you have to make the backdrops in very quick time.
After this, so all this was stuff I've been doing right now I'm working
on a project called haillight, a programming language which isolates
scheduling from algorithm and image processing pipelines so you can
optimize these in different ways. The code does not get entwined. You
don't have to mix your algorithm into the four loops and vectorization
code so it's clean, it's supposed to make, it's being designed to make
life easier for people who wrote code. So this is in collaboration
with a group at MIT. In terms of future research, I would like to
exploit multi-imagery in detail. People do take lots and lots of
pictures but right now we're giving them tools to edit only one picture
at a time, which means that if we edit only one picture you only want
to keep one of them of all the pictures you took. At the same time, we
are saying let's do image-based rendering which is a nice innovative
way of visualizing lots of pictures at the same time. So we have to
give them tools to edit whole datasets at the same time not just one
photograph. And the edits we're talking about can be appearance
manipulations these are some pictures I took and if I want to change
how the statue looks, I would like to do it in one image and get it
propagated to all other images automatically.
This sounds a bit easy for static scenes but once your scenes become
more dynamic, if the scene geometry changes, if I have a friend in the
picture and I want to change my edits on my friend, it becomes a harder
problem. Interesting nonetheless. And at the same time we can try to
change the scene itself, for example, this is where we emptied a metro
station in Paris to take pictures in that station. And the problem was
they didn't like the advertisements, they wanted advertisements of the
1950s how do we remove advertisements in one image get that to
propagate over all 50 images that we took over there. And this has to
respect all the padlacks effects and everything. So that's a
challenging problem. And what happens if I took very few pictures and
I don't have enough data to propagate from one to the other can I use
pictures which other people took to do the same thing? That's the next
level of all of this research and finally can we think of these very
large photo collections as unstructured light field. Light field is
very structured capture small object hundreds of images and you can do
amazing things after capture. Can we develop similar functionality for
these very large group of images, can we think of them as unstructured
light fields? Lastly, I would like to keep learning. I would like to
explore machine learning and vision in more detail, work on image
propagation and manipulation this is something I've wanted to do in my
Ph.D. itself and I never had enough time to actually work on this
properly, and I like being a part of groups which do lots of other
things, a bit of work in systems and stereo is something which I would
love to do, and my last project is sort of an example where I'm working
with a compiler's group to learn more and more things.
With this I'd like to thank you all for your attention.
to answer any questions. [applause].
I'd be happy
>>: So all this is right now running in real time?
>> Gaurav Chaurasia: Yes, yes, yes. This is completely real time.
The preprocessing stages, of course, are the whole reconstruction
pipeline. That's all off line. But the actual rendering runs at 50
frames per second without any problems because the Lees squared
problems are very small, easy to solve. Do parallel course and all
kinds of stuff. It's all real time. But as I said the overhead is
still there. You still have to store like 15 images to do rendering
off a dataset. How do you optimize this how do you optimize the amount
of -- how do you optimize the amount of memory some of the other
things, that's what I mean by efficiency.
>>: You do nice work on the analysis of how far can I go restore the
images when we have -- why not -- have you tried to do the opposite and
guide the people when they capture the images where they should be.
>> Gaurav Chaurasia: Yes, this is actually something we've thought
about. But we never got strong ideas. This is actually a work -- so
there's a paper by MIT where they're trying to capture a light field
and the system guides the person where to move the camera around. And
that's very cool. But we would like to ideally you should be able to
extend this into a multi-view environment -- in this work we get
guidelines of how many images to capture. It doesn't tell us where to
capture, but if you see like the heat map over there, if you have one
image, that gives you this blue zone. If you have another image that
can give you another blue zone. And then depending on how many
imimages you want to use you could make the whole thing completely blue
if you use lots and lots of images. That gives us a guideline on how
many images to use. This is assuming you do a street capture where
there's a car and there's not much you can do. You can't move forward
or backward. But hand-held images it would be good to explore this
field.
>>: I had another question, when you moved from using a single global
warp to using a single warp for the super pixels, doesn't that mean
that the super pixels being warped independently can introduce lots of
cracks.
>> Gaurav Chaurasia:
Yes, it does.
That's exactly --
>>: [indiscernible].
>> Gaurav Chaurasia: Exactly. You do get this -- so we expected this
problem to show up. Thankfully as a consequence of some accident and
some design this hasn't happened, first of all, because we're using a
shape preserving warp. The super pixels itself do not deform that
badly. The way super pixels A deforms is quite the same way B deforms,
if they were enabling an image at rest they sort of maintain their
shape in the next image.
>>: Same depth.
>> Gaurav Chaurasia: Yes but you get super pixels from other images
which don't do this which will introduce, which can be used to do some
hole filling in the cracks left in between. That's exactly why you use
multiple images to fill these small areas and give the parallax effect
as you go along.
>>: It's interesting that you don't notice a lot of either popping or
costing in those cracks, I guess because of what you said, the warp is
well behaved enough that ->> Gaurav Chaurasia:
views.
Right.
It's consistent enough on neighboring
>>: Similar warps.
>> Gaurav Chaurasia:
Yes.
>>: Same thing you would expect if they were similar depth.
If your depth was accurate, then it will always be pristine. But
that's what the warp is supposed to alleviated if you don't have
exactly the same depth. And some of it comes from the blending
heuristics, because in some cases as I said if you do notice situations
you're not supposed to blend these two guys, so which one do you favor,
that's whether we say okay choose the background. A parallax error in
the background is less noticeable than the parallax error in the
foreground. It makes all the difference, very simple idea. But
improves your experience by a fair margin.
>> Eyal Ofek: Okay. Very nice work. Thanks.
[applause]
Download