>> Rick Szeliski: Okay. Good afternoon everyone. ... drifting in. So it is my real pleasure to...

advertisement
>> Rick Szeliski: Okay. Good afternoon everyone. I see some more Live Log folks
drifting in. So it is my real pleasure to welcome Daniel Martinec here to Microsoft
Research to talk to us about his thesis research. Daniel just finished his Ph.D. at the
Czech Technical University in Prague, which is where the ECCV conference was in -what was it, 2004? Right. And that's where I started taking some of the first photos that
became Photosynth's later.
Daniel was also at -- while he was a student, was one of -- participate on one of the teams
that was working on one of these 3D reconstruction challenges that I had run for the
ICCV conference. And most recently, since his graduation, in the summer he started
working for Microsoft. He's going to be joining the 3D team in Boulder at Vexcel
working on commercial versions of Photosynth.
>> Daniel Martinec: Thank you. So what you can see here is a result of automatic
pipeline which we developed in Prague.
So this presentation is my Ph.D. [inaudible]. My thesis was led my Tomas Pajdla and
Mirko Navara at the beginning. It was all -- all of this work was done at Center For
Machine Perception at Czech Technical University in Prague.
I will give a brief introduction into 3D reconstruction for multiple views, and I'll explain
why it is challenging. I'll show some works of others and how they differ from our
approach. After I formalize the problem, I will explain a few contributions of the thesis
and some of them in more detail, and at the end I would like to tell you about my new
nice results.
So we have some images of the input which can be entirely unorganized. And the task is
to calibrate the cameras and reconstruct what we can see in the images. If by general
assumptions we don't know the positions of the cameras and orientations, once they are
computed, it is quite easy to obtain reconstructions like this, if they are obtained correctly
enough.
And, moreover, we don't know focal lengths, but we assume that the other internal
camera parameters are known, like image pixel, principal point is in image center and we
have square pixel. So these kinds of images can be found for example on Flickr.
This task is challenging because there can be many, like thousands or hundreds of
thousands of images. This example has more than 2,000 images. There can be many
occlusions, almost 100 percent, like in this case. There can be strong perspective effects,
so simple affine camera model cannot be used.
There can be various connections with the cameras, like narrow or wide baseline images
and some images can be panoramas. Images can be either entirely unorganized or they
can be in some sequences.
And there is another challenge, which is caused by repetitive structures and similar
objects on the images. This causes that there can be many mismatches which, however,
satisfy the two-view geometry. So, for example, these four points happen to be quite
similar to these four points, but they just accidentally lie on the corresponding epipolar
lines. But these are different buildings.
And the situation can be even worse. In this image pad there is something like 1,000
inliers to some epipolar geometry, but there is no true match because these are different
buildings. So this building lies on this side in this image. And of course to obtain a
consistent reconstruction of all the environment such wrong geometries have be the
rejected.
So a high pipeline starts with some images. We detect some regions of interest and
match those similar ones. And using RANSAC, we regain some two-view geometry
which can be in 3D like this, and recently tried to combine somehow such pair-wise
reconstructions into a multiview reconstruction.
Once that's done, it's possible to identify some image pairs which are suitable for dense
stereo. This is how you [inaudible] with narrow baseline. So we can obtain such quite
full dense maps which -- disparity maps which correspond to points, point clouds in
space. And we simply merge these point clouds into such point cloud which can be
further approximated locally by [inaudible]. And here we show just the principle planes
of the [inaudible] covered with texture.
Here we have millions, tens of millions of points, and here we have just a few hundred.
So we call them face scales. But one can go even further. This is the result of my master
student [inaudible]. He even found surfaces and used many depth maps so that it's more
consistent and more complete.
About all my studies were mainly about this step from two-view geometries to multiview
geometries, how to combine them. There's been a number of research groups which have
dealt with this problem. On video sequences, there were projects, for example, at
University of Grants in Bexsell [phonetic]. A realtime 3D reconstruction was introduced
two years ago. For unorganized images, there were also some universities working on
this theme, like in Oxford lab in Washington.
And there are two ways how to approach the problem of unorganized images. First is the
sequential; however, it is quite prone to local minima. And the other is factorization,
which is, on the other hand, not robust to occlusions. I will speak about this later. And
there is another approach which I call batch approach introduced by Guilbert, but it's
usable for affine cameras only.
So what we do, we wanted to develop some methods for automatic three-year
constructions, which are general and can be applicable at the same way for both
organized and unorganized images. We try to use all the data at the same time. We
started our studies with factorization; however, as I said, there's a problem with
occlusions, so it's applicable on a few images only.
Jacobs generalized factorization so that it can be used for occlusions too, but, I'll speak on
this later, there are some problems with that. And our methods, most of our methods
which we developed, we call gluing of partial reconstructions. And this is not only
robust to occlusions but also to mismatches, as you will see.
So we assume that we have some measurements in images. Here we use homogenous
coordinates because it's -- in the same way you can formalize a problem even for
omnidirectional cameras. So we want to compute some point in space, again,
homogenous coordinates, and camera so that the point is projected by camera into the
image measurement up to some unknown depth.
All the blue things here are unknown stuff. The only thing known is the image
measurement. So it's possible to stick all such projection equations into one larger
equation, where it's easy to obtain in these [inaudible] projective depths from epipolar
geometry. And if this matrix is complete, this crosses then for missing elements which
are -- they're namely due to occlusions.
So if this matrix is complete, it's very easy to factorize it into a product of cameras' end
points using singular value decomposition. This is the result of Tomasi and Kanade and
many other people who use this approach.
However, if there is an occlusion, you cannot do this. So people tried to identify some
large submatrix of this big matrix, do this factorization, and then to extend the solution to
other parts of the matrix. However, this approach is called imputation, because you are
filling some holes in this matrix.
But the problem is that by filling, you introduce some error into this matrix. And this
error is -- comes just from a few measurements. It's very local. And, moreover, when
you are trying to extend this solution, you are using this error again and again. So it's
highly dependent on when you start. And so for difficult problems with many, many
occlusions, this really doesn't work.
There is another way people tried to initialize these two matrices in random and then use,
for example, power of factorization or [inaudible] to alternate between filling and
factorizing again. However, it doesn't work really for difficult data.
So what we do, we came up with -- we just look into our pair-wise reconstructions or into
the camera and usually we know the internal parameters or we can estimate based on
pair-wise image measurements. And so it's possible to decompose each camera into
product of this internal parameter matrix rotation and translation. And I will come back
to this later.
So we started our research with projective factorization. We combined perspective
cameras and occlusions by Jacobs. Later we extended it for omnidirectional cameras.
And then I was playing for a while with this famous Dinosaur sequence which has only
36 images. And it was really a problem. I couldn't do it using Jacobs' method. The
problem with that method, it's a very nice [inaudible] way. Jacobs just takes photo parts
in that measurement matrix and when there are missing elements some big subspaces are
generated in these samples, these photo parts, and intersects these subspaces.
However, he does it in a way that he goes to complementary subspaces and he makes
union then and goes back. This is the simple De Morgan rule. However, by going forth
and back, he loses the connection with the image measurements. And when I found this,
it was almost sold. Because it was that insufficient to reformulate the problem in the
original subspaces, and this is what we did and it works amazingly well even when
everything is only in projective without camera calibration.
Then I studied further and came with a method which doesn't even need these projective
depths. Then there was ICCV contest and there were panoramas. And so I made some
detection of these panoramas. And later on on ICC '05 Fredrik Kahl's method appeared
which can just take absolute rotations, which I could already compute, I will show you
later, and he casts a minimization problem using second-order cone programming, and so
the translation's end points can be estimated almost for free. And then I spent some time
on making things more robust, which you will also see.
So I think this was my biggest result, which I personally consider. And I will spend a
few slides on these two remarkable results.
So, as I said, we start with some pair-wise reconstructions. To make things simple, we
forget about image data for now and we have just a camera viewing directions. And,
further, we forget about translation for a while. So we have just relative rotation.
And we want to register such relative rotations into one coordinate frame or reference
frame. And how we do it, so each such pair of viewing directions has to be somehow
rotated so that it aligns well in the reference code end frame.
And this can be very easily described using 3-by-3 rotation matrices, so you can see
what's actually happening there. And we end up with such equations which are linear.
This is a great thing. There is a lot of unknowns. You can see these matrices on the right
side I call consistent rotations. These are the absolute rotation, this is what we want to
estimate. And these matrices are mappings between the local cognitive frame of the
pair-wise reconstruction and the global coordinate frame.
So it has a nice geometrical meaning. And, moreover, it's very easy to solve. We used
just eigenvalue solver. The problem has global minimum, which is nice. The problem is
very sparse and very well conditioned, because all the matrices there are formal matrices.
It's the nicest matrix. And it's also very fast. It's a fraction of a second, even if we have
thousands of image pairs.
And, of course, the resulting two matrices are not exactly rotations, but they are very
close to them. So it's sufficient to project these matrices to the space of [inaudible]
matrices or to use SVD and just to set all the singular values to one.
So this is the result of registering rotations into one coordinate frame. You can see that,
for example, this camera number 22 is there several times. Because still the translations
have not been registered yet. But after calling chaos method, you can see that a problem
is almost solved.
Here we have reprojection errors. The minimal error is below one pixel and the
maximum is below 20 pixel. You may say it's quite a lot. I will arrive with a new
method which doesn't have such problem. But still on a linear method, it's quite a good
result and it's very fast. Rotations are in fraction of second and translations are in a few
seconds.
>> Rick Szeliski: Daniel, if I remember, some of the techniques, I don't remember if
David Nister is one of the people who worked on this, when you know the rotations, there
are solutions for the translations but points are away, points near infinity often bias and
make those things not work very well. Is that correct, points of infinity have problems?
Or is it only if you try to do sort of a linear version of the problem?
>> Daniel Martinec: No. This is second-order cone programming.
>> Rick Szeliski: Second-order cone programming takes care of that because ->> Daniel Martinec: Yeah.
>> Rick Szeliski: [inaudible]
>>: Okay. Okay.
>> Daniel Martinec: There's no -- I have met no problem with this method, except that it
relies on point measurements.
>> Rick Szeliski: Okay. So it doesn't matter you're also seeing points out of infinity
[inaudible].
>> Daniel Martinec: Yeah. And this is the result after final bundle adjustment. And you
can see that nothing really changed, which shows that we are -- here we are very close to
the local minima. Of course, we don't guarantee global minimum, the problem is
nonconvex, so it's a -- and but our reverse reprojection error now is below two pixels,
which is I think nice solution.
And also you could see that there is a big point cloud which moved, and it's due to points
visible in these two images which are in fact the panorama. It's just a closeup. And this
didn't really matter in the method, if they were panoramas or not. And even it's -- it
doesn't matter if the points are on some dominant planes. This is really robust too, all
these things.
This is another example of almost 300 images from a mountain scene. This is another
shot. This is an example of the two images. So a few words to mismatch identification.
As I said, in this image pair, RANSAC on epipolar geometry identifies these mismatches
as inliers. How to ->> Rick Szeliski: Can I ask a question about the previous approach? So you assume that
you have enough rotations estimated pair-wise that you can figure out a globally
consistent set of rotations. When you have very wide field of view cameras, you can
often reliably estimate rotations. As your field of view gets narrower, isn't there sort of
ambiguity between rotation and the translation that makes it hard to get good pair-wise
estimates?
>> Daniel Martinec: Yes. So, for example, these two cameras ->> Rick Szeliski: Right.
>> Daniel Martinec: -- translation, in fact, is not determined there at all.
>> Rick Szeliski: Right. Those two cameras because they were taken from the same
point of view. I'm saying if you basically -- consider taking out your 35 millimeter
camera or you point-and-shoot and the way all these cameras power up is they're in
wide-angle mode. If you force yourself to shoot a world with a 100 millimeter lens,
which are pair-wise rotation estimates, just basically be poorly conditioned, or if you
were flying from an airplane that was shooting aerial photos. In other words, if the
images are less perspective, can you estimate rotations accurately pair-wise?
>> Daniel Martinec: Well, I have not worked with such a camera.
>> Rick Szeliski: Okay.
>> Daniel Martinec: Yet. A strong assumption of this approach is that these [inaudible]
orientations are quite well estimated. But it works somehow even if there is error of 100
degrees, which I have seen this on the mountain scene sequence.
>> Rick Szeliski: Okay. So maybe because you're doing a [inaudible] least squares,
those errors [inaudible].
>> Daniel Martinec: Yes. It's a bit least squares, and each of the terms has almost equal
weight. So even if one is really off and there's enough -- and if others, they can make it
better.
>> Rick Szeliski: Okay.
>> Daniel Martinec: But -- yeah.
So I've made an observation that if there is a mismatch, it's usually far from other images.
Sorry. Far from other point correspondences. And there is also both in image and both
in depth. So I made a very simple heuristic that I fitted all my measurements in image
multiplied by those depths, I fitted the Gaussian to them and removed everything which
is on the [inaudible] of the 25 percent of the points which are furthest from the Gaussian
center. And it turned out that all such bad guys disappeared this way.
Well, there is more comments on this in the paper on the thesis. I know it's heuristic,
well, but on the other hand, if you have only such two images, I cannot really -- this is
really difficult to find out which points are mismatches and which are not. If you don't
have any other information [inaudible].
>> Rick Szeliski: But if you're weighting points by depth, these scenes happen to have -they're closely cropped so you don't see a lot of the background, right, but if you had
shots where there was just a lot -- you saw the mountains in the distance and things like
that, and there was a dominant object, then, you know, things far away, if you really are
weighting them by depth, will really ->> Daniel Martinec: Yeah, yeah. Of course, you have to [inaudible] all your depths so
that the mean depth is one or something like that. It's kind of a projective approach. It
works even if your cameras are not calibrated. So it's very general. It works for
omnidirectional cameras. And especially on that mountain scene with those
kilometer-distance objects it works.
>> Rick Szeliski: Okay. So it's more in a projective framework, basically if you take the
homogenous vectors, turn them into one norm, then it's meaningful to talk about the
center of mass.
>> Daniel Martinec: Yes, exactly. Yes. Of course, by this way, it happens that I remove
some good data too. But I don't mind. I have no data. And 75 data is still there. And
this image pair is the -- it had the largest amount of inliers, 25 percent. I haven't met any
other data like that.
So these are the inliers which survived this test. And by removing these mismatches, our
reprojection error after doing translation estimation went down from 100 pixels to 22.
But we can do even better. It's possible to pick out only four points among these -- our
inlier estimate so that they represent the [inaudible] geometry almost as well as all these
points.
And this is very simple. Because this -- the camera matrix has just four columns, which
means that when it projects the points, only a subspace of a dimension 4 is generated. So
the only thing you need is four lineal independent cones. Well, and of course the
question is how to pick these different or independent two points. We use the same
technique as for identifying mismatches. So we fit a Gaussian here, pick one point which
is most different from the others, most distant from the center of the mass and repeat this
three times on a 3-dimensional subspace and 2-dimensional subspace. Because once we
have one point, we can care just from the remaining data which is not explained in this
dimension.
And by this we then speed up vector 2,000. Well, so instead of spending four hours on
translation estimation using those tens of thousands of points, we use just few hundreds
of points. And it's in a few seconds with relative -- really similar results in terms of half a
pixel or so.
>>: Do you think this is for applying Kahl's technique, right ->> Daniel Martinec: Yes.
>>: -- for the translation? But what if the points you pick aren't present in other images?
When you need to find points that are also there -- or that's right. Kahl's technique is only
pair-wise, right?
>> Daniel Martinec: Yes. Everything so far everything was just pair-wise, which means
that you have -- you can have really bad data. It works even if there is no point visible in
three images.
>>: [inaudible] second-order cone program is slow?
>> Daniel Martinec: Yes. Yes. Exactly. It's very slow.
>> Rick Szeliski: There's something there. [inaudible] says that you need three-way
overlaps for reconstruction to be consistent, right? Or is that -- there's enough sort of
things looking at each other from different views that there's only one sort of consistent
solution, right? If you walk down a street and you never see any point more than two
images, it's going to be hard to figure out exactly how fast you're walking.
>> Daniel Martinec: Yes. Of course.
>> Rick Szeliski: But if we're all looking at the same Dinosaur, then you don't really
necessarily need to see more than three images.
>> Daniel Martinec: Yes. If your camera moves along a straight path and you are using
only pair-wise data, then of course you cannot estimate the scales.
>> Rick Szeliski: Right.
>> Daniel Martinec: This never happened in my data sets. And of course you can plug
the points in three images, and I use them in a number of images in the final bundle
adjustment.
But it's better to use them as soon as possible, as early as possible in the optimization.
And in my new stuff, they are there from the very beginning.
So this is the scene which you've seen at the beginning. And he just -- I can show you the
result of photos sent. No. Not this one. So it's split into three components, which doesn't
say anything about multireconstruction, but it says just that I have better matching so far.
So this is another example. There were several such epipolar geometries which are
wrong. You can see that this window is slightly rotated here or it's entirely another
window, just it happened that all of these tiles on the roof just look similar, so it's difficult
really to find if it's true or not.
But we use all the pair-wise reconstructions and we obtained maximum error mode in
100 pixels, and we used the simple technique, we identified the pair-wise reconstruction
in which the reprojection errors were the highest and removed these and repeated that a
few times. We arrived to a reprojection error of seven pixel here, and you can see that
the reconstruction is really consistent. The surface is visible by different cameras fluently
go one into another.
So this is the most difficult example. Here you can see that we kept removing epipolar
geometries but then it started to oscillate at about 30 pixels. And I think this is really
difficult scene and I think the method for identifying nonexistent epipolar geometries has
to be more sophisticated than this simple relying on that least squares are robust enough
if you have good conditioning.
So I've spent some time on this research, but then I -- then my attention was again given
to enhancing precision. So this -- another example that -- seen with more than 2,000
images, it was a paper model made a few hundred years ago. It's a part of 6000 building
model of Prague. And it was a very nice project; however, it was canceled, so this is the
only data which I have from it.
This linear technique worked on it the same way as some of the other scenes, so I'm
pretty sure that it would work on even ten thousands of images and maybe more. But
unfortunately I didn't have any larger data yet. And, moreover, I could not reconstruct
the dense reconstruction using my software because I didn't want to spend time by
rewriting it, and the depth maps just didn't fit into 16 gigabytes of memory. And then the
project was canceled.
So I've shown some techniques on projective and metric gluing, which we developed,
which are quite accurate and relatively fast and robust. Of course, I had to write all the
software, or some of the software was from others. But, for example, I found some
heuristics for speeding up matching image pairs. I made some relative pose estimation
when the focal lengths are entirely unknown.
I worked also on line reconstruction. The software is part of multicamera self-calibration
package which is widely used in the world. And I wonder that people still use it because
my technique in that package is from -- it's six years old and there was never a need to
replace it by the new, more robust techniques.
The software was sold to a Canadian company last year and it's used not only by people
at our university but by another university too. It was well accepted by vision
community. We ended up second at the ICCV '05 contest. We published it in quite a
few papers.
So this is everything from my -- from the stuff which I published in my thesis, but there
is some new stuff too. We've touched some programs there already, and I think -- I
identified two programs. First is that the rotation representation is not good in that linear
estimate because what I gain is only approximate rotations, not rotations because
rotations satisfy some nonlinear constraints. And thanks to that, some errors can be
inherited from some pair-wise reconstructions, and the translations can produce larger
errors, like in 50 or 100 pixels, which may sometimes converge to a nice solution, but
sometimes not. And I didn't really like to -- I wanted to make it more robust.
So my new approach has no approximation, no linear estimation of absolute rotations like
at once. It doesn't even use second-order cone programming. It simply takes the
pair-wise reconstructions which we have and tries to modify them slightly so they are
more consistent with each other. And when they are consistent, the program is sold.
Because then it's very, very, very simple to chain these consistent rotations -- consistent
reconstructions which are consistent both in rotation and translation and scale too.
And the solution has some very nice properties, because we have low reprojection errors
during the whole optimization process. Each of these pair-wise reconstructions has some
image points. I use four points, which you've seen, but you can use any number of points
if you want. And so on these pair-wise reconstructions the problem is solved. However,
it's inconsistent with the other reconstructions.
So the only thing which is needed is to add some penalty term which penalizes some
constraints which are on top of these pair-wise reconstructions, and this is something
which has to be solved. So I think that by keeping these reconstructions consistent with
the data all the time within reprojection error of let's say 5410 pixels, we can avoid some
really bad local minima.
So we get better accuracy. It's raw scalable too. I've tested that on 300 views only. But
as the old stuff worked on thousands, I can try it. I may try it on -- then it solved
thousands or even more. And some things which are needed in the project, I work on
Geosen's [phonetic] now. And they need some stuff like getting some priors to the
cameras. I think this is really natural way how to do it. We have these pair-wise
reconstructions, and if you have some idea of how the cameras are rotated or translated, I
think it is very easy to edit, to do them, as the data terms.
Well, so this is not finished, so I can show you some equations in my paper. So this is a
pose, relative pose between images, between views I and J. This is relative rotation, and
this is relative translation. Well, and this is composed throughout this pose, which is just
chain of such relative poses. So, for example, if you have the images, we just chain these
two and arrive at something like this. So this is very simple.
And I want my relative pose to be consistent, to be the same as some chain of relative
poses by some other relative -- by some cycle, either triangle or cycle of larger length.
And so here I have some theorem that is what I want is something which has to be
satisfied in the final reconstruction. And so the question is how to enforce it. Well,
this -- it's quite simple to see that this rotation should be the same as the composed
rotation just by these relative rotations are just multiplied. It's slightly more complicated
with translations.
But if you wanted to minimize in the penalty term, if you want to minimize the difference
between the left and the right-hand side of this equation, there is a problem. And it is the
scale in this translation, thanks to which these terms have different weights than these
terms.
So it turned out that it's much better to minimize the difference in essential matrices.
This is one way. Or another way is to minimize reprojection error using on composed
camera matrices, which is the camera matrix of image J, these are internal parameters,
and instead of using HJI, I used the composed relative rotation.
So we arrive at the end to two formulations: One is using these essential matrices and
one is using the reprojection errors. So this first term is the standard bundle adjustment.
This is just the reprojection error. Well, the reprojection error is defined. You throw into
it some image points, image correspondences, some cameras. This is a pair of cameras.
One of the cameras is fixed and the other is represented by relative rotation and
translation. And this is my points, either four or more, and this is relative rotation.
These are just the squares of the errors and the sum says that I sum over all images in that
relative -- sorry, pair-wise reconstruction or columns or points. And I do this -- I sum
over all pair-wise reconstructions. So this is simple. And this is the term which is
[inaudible] function, which would be zero at the end, and, of course, at the beginning it is
not usually because RANSAC doesn't know about the other data; it relies on that which is
in two images. So this term is nonzero and it turns out that it is sufficient to minimize its
weight just to initialize the weight just by -- do 1001, sorry, 1000s. No. This rate is
1,000 times smaller than 1.
>>: You're doing the continuation method where the right-hand term initially is small
[inaudible].
>> Daniel Martinec: Yes. Exactly. And I gradually want it to be more consistent. But if
I push too much, it's possible that this term gets higher reprojection errors, and this is
what I don't want. So I have to be quite careful about this.
And another possibility is to minimize exactly the reprojection error but on the composed
camera, which is instead of the relative rotation it has the composed rotation. So this may
be closer to -- and also there is also some weight.
And that's all. The only problem is how to minimize this function. And it turns out that
it's possible to either have one weight, which is shared by all these triplets or how
many -- how long cycles you want to have and how to not to push much, too much. And
it works very well. I can show you some ->> Rick Szeliski: Are you going to say a little bit later about having hurt triangle
weights? Because there's an index there, right, so it suggests that you have a different
weight for each T.
>> Daniel Martinec: Yes, it can be different. It's up to you. I have it here with different
indices because -- because it's possible to solve this problem locally only. Imagine we
have some network of our relative -- of our -- well, these are -- each line is an epipolar
geometry.
>> Rick Szeliski: Right.
>> Daniel Martinec: And let's say in this triangle that penalty term is the highest.
>>: Okay.
>> Daniel Martinec: And it's possible to just take these few neighboring triangles, so this
is this one, this one, yeah, and that's it. And few more triangles, which neighbor to these,
which are this one and this one, and to optimize only this part. These three epipolar
geometries or these other can be modified too, rotations and translations, but these are
kept fixed.
>> Rick Szeliski: Okay.
>> Daniel Martinec: Which means that we lower some residuals here ->> Rick Szeliski: Right.
>> Daniel Martinec: -- but stay consistent with the rest of the world.
>> Rick Szeliski: Oh, so you use the omega IJK to basically freeze parts of the
reconstruction and do subsets.
>> Daniel Martinec: Yes.
>> Rick Szeliski: Okay. So -- well, why don't you finish your talk. I have a suggestion.
>> Daniel Martinec: Yeah, that's all. That's all. It's not finished. Also I couldn't finish
the slides ->> Rick Szeliski: [inaudible] results.
>> Daniel Martinec: Sorry, some results, yes.
So I can show you the result of Photosynth again. So you can see that there is really
some -- oh. I don't know if you can see it. There is some problem that some of the walls
in this right-hand side are not perpendicular or parallel with each other. So in my
method, this is -- I would say this is correct.
>>: Your method produces nicer looking colors.
>> Daniel Martinec: Thank you. This is one tenth of all dense points generated by dense
stereo in our pipeline. And these dense points have the nice property that they can really
reveal the inconsistency, which I need, because I'm trying [inaudible] algorithms which
are highly, highly accurate.
And here you can see that this part of the roof is there twice, maybe more times. This is a
special data which I want to work on, because there are not many images around. There
is another image set with this house, but it's from summer. That's why there is more
features on the ground and it's very nicely reconstructed on the Internet. But this was -is winter. And so on this scene I really realized that I have to use triplet-wise
correspondences, and I'm using them in this framework.
It's really easy to express triplet-wise reconstructions using relative rotation and
orientation only. And it is really equivalent to having triplet-wise partial reconstructions.
This is something which we discussed in the last few -- couple of days.
>>: So how important do you think it is to get the matches versus -- so [inaudible] my
experience that I've had, so I've looked at the summer dataset of that a lot, and if the
[inaudible] does not close the loop, clearly you have [inaudible]. So if you miss a match,
typically what's happening in some of the datasets is that it miss -- you know, we were
using SIFT style matches, and so it just won't -- it won't do that. You're using a
[inaudible] variance style of matching.
And so the general thing that we've seen is that if we get the matches, we typically don't
have too much problem of the loop opening. And so if the match were closed, that
would -- which we were able to do when we kind of raised the resolution when we're
running that, then the optimization tended to have no problem [inaudible].
>> Daniel Martinec: Yeah. I -- if -- of course, if I don't have the loop closed, I cannot do
anything with it. And if I have the data and it's weakly conditioned I have just -- in some
of my datasets I have just eight inliers and I'm happy of them and I can make use of them
and I can really make it consistent on all of these points.
So, of course, the more images you have, it's better. But I don't think without the log
groups you cannot expect that this would be somehow better. This is just a standard
bundle adjustment. Nothing more. Maybe I -- it would [inaudible] local minima. But I
need the hidden data too.
>>: So in no work can you force it, they both use the same matches?
>> Daniel Martinec: No, no. Yes, of course. That's another point. Yeah. The pipelines
are different. We have many nice features like [inaudible] and many types of MSERs,
and we use local affine frames and DCT coefficients for descriptors and SIFT features
too.
I found out on the mountain scene that the SIFT features are not really good, because
some of them, pair-wise reconstructions have as low as 2 percent of inliers, because
everything is stone. Well, so MSERs with local affine frames work much better than
SIFTs. That's why for identifying really epipolar geometries I use MSERs only in
[inaudible] I use SIFTs for identifying the data, which may help. But I don't rely on
SIFTs.
>>: Just to clarify, are you saying that you take multiple subsets of views to answer to
multiple bundles and that's what all about the cost function at 14, and then you add
consistency terms between them, and then you sort of pump up [inaudible] until ->> Daniel Martinec: Yes.
>>: -- it goes to infinity and then they have to be consistent, you can just roll them out
[inaudible] consistent up to [inaudible].
>> Daniel Martinec: Yeah, I hoped originally that there could be just one weight, which
is shared by all the close, all the loops. And I thought, okay, but just raising it by small
amounts I must arrive to something which is really consistent.
But unfortunately this is not true. I think it's because this is just least squares and it has
many, many local minima. So then I arrived with this -- with this local method which
just tries to repair the highest residuals. So I have like -- graph like this where each -here I have triangles. And at the beginning they all have large residuals. Like this. And
so I identify the triangle which has the highest residual, like this one, and do something
on it, three steps [inaudible] so that the residual gets, I don't know, to 50 percent or 70
percent, or how much you want -- how much you want to be fast. And together with that
some neighboring residuals go down too.
So now I'll -- I improved the consistency between the data. And now I identify again and
improve it. And at the end I have like this. And so first I start with these essential
closure constraints because they are applicable even when your points are behind the
composed cameras. Because this can happen. If your rotation -- relative rotations, they
can have error like -- like a hundred degrees, they can be a hundred degrees away, and it's
very easy that some of the points are just behind the camera. That's why you cannot use
this formulation. Because it's nonsense. If your [inaudible] point is not in front of
camera, it's nonsense. That's why first I do a few steps using this part, and then I switch
to this, which really is reprojection error. That's nothing else.
And at the end, of course, it's not entirely consistent due to noise, due to image noise. So
I -- so I start with some triangle, make the 3D reconstruction of it and add some other
triangles. And usually I don't have to bundle again. Usually everything is below two
pixels. So only when it grows to, say, 5410 pixels, usually not more, unless -- only
sometimes when they are panoramas, and then the translations. So I don't know exactly.
So I can just grow the solution. And this is something like your seed -- how it's called -growing. Yeah. But now I have everything preregistered, very, very tightly, so ->>: So once you've got the [inaudible] did this bias even grow [inaudible] consistency
[inaudible] groups that are created every time you add a view, would this do as well as
[inaudible].
>> Daniel Martinec: I've tried to do something similar. I've tried to grow like two
components which somehow overlap somewhere, or maybe more components, and tried
to enforce similar constraints like this, on these sharing cameras. Well, that may be other
constraints. I haven't tried too much of that. But it turned out that once you put this big
set of cameras into one consistent system and -- they are very, very tight and it's really
difficult to do anything with them. So it's much better. It seems to me that it's much
better to have everything just pair-wise, make everything consistent and just grow than
doing this growing here.
I don't know. I -- I really had problems with this. It was just -- just inconsistent and it
just didn't want to be together on these overlapping cameras. Which is the same case as
you told, which if you have -- sorry. If you have this -- no. No. This is similar case.
>> Rick Szeliski: That's the loop closure.
>> Daniel Martinec: Yes. I'm really skeptical. If this is in one rigid frame and you try to
do something here ->>: [inaudible] the whole thing but [inaudible] say this is a new loop so I'm going to
really pay attention pulling that together.
>> Daniel Martinec: Yeah, I know, but ->>: [inaudible]
>> Rick Szeliski: Your triangle example doesn't have a loop in it. But if you did actually
drew a loop, would you run into the same problem?
>> Daniel Martinec: Oh, yeah, it has many loops. I have many loops in my data.
>> Rick Szeliski: So as you're going along and you're kind of saying, well, I'm just going
to keep adding things in, at some point you might close the loop because you're starting
with something. After you've run this algorithm, when you want to pull out a final
consistent geometry, right, camera set of matrices, you say start a triangle and kind of
walk out and don't rebundle unless your reprojection area gets too big.
>> Daniel Martinec: Yes, exactly.
>> Rick Szeliski: But you're going to keep doing that. You didn't tell us what happens
when the two ends meet.
>> Daniel Martinec: Yeah. When I'm doing this -- so let's say I make this into one
consistent frame, one rigid frame, but I keep the connections with the rest of the world,
which is still pair-wise. So even if I add something and maybe reprojection errors can
grow to ten pixels here, but I bundle again with everything others. And this repairs a bit.
So even if the loop should be somewhere else, I get there. Because, yeah, I'm growing
here, but this shouldn't be here, it should be here.
>>: [inaudible] final triangle.
>> Daniel Martinec: Hmm?
>>: Right? So you're guaranteeing that everything is globally consistent in the part that
you're growing out ->> Daniel Martinec: Yes.
>>: -- and that all of the relative orientations locally are consistent.
>> Daniel Martinec: Yes. Even ->>: But at some point when you had that last triangle that ->> Daniel Martinec: No.
>>: -- that generates the cycle ->> Daniel Martinec: Okay. I can ->>: -- you're going to have to reconcile it, right?
>> Daniel Martinec: This is difficult to draw. So I have something ->>: There's nothing that guarantees that when you add [inaudible] ->> Daniel Martinec: No, there is, there is ->>: -- [inaudible] that they are absolutely consistent [inaudible].
>> Daniel Martinec: So I -- let's say I grew to some point like this, and still here I have
some triangles, which are not in that big reference frame. So I continue growing. And
now it turns out it should be like this. Well, this camera should be here but rather here.
But because it's connected to the pair-wise reconstructions, the rest has to do some
changes.
>>: I think I get what he's saying, which is that as you freeze certain cameras to make
them into absolute global coordinates, you still keep reoptimizing the other pair-wise
[inaudible] so if there is a loop closure, the things that aren't frozen yet are going to keep
reoptimizing themselves.
So it's like you know how when you pull a tire onto a rim when you're fixing a bicycle
tire, and sort of other parts that aren't quite over the rim yet have to start moving so the
whole thing bursts? So I'm guessing that maybe what Daniel is saying is that even if
you're putting final numbers on certain camera poses because you want those global
numbers, the ones that are still not finalized participate in the optimization, so they're
pulling you into towards sort of a closure.
>>: Yeah, but it's not exactly like a bike tire because it's just these relative -- so you're
only opposing relative strengths locally on the part that's not on the rim.
>>: Yeah.
>>: Everything can move [inaudible].
[multiple people speaking at once]
>>: [inaudible] remaining error.
>>: Nothing that guarantees that these things sum up to some absolute correct path.
>>: Yeah. I mean, the things that are frozen, if you're doing a freezing strategy, they're
frozen, you'll never get to reoptimize the base of the [inaudible] you have to hope that the
pair-wise stuff distributed at the error [inaudible] ->>: No, you're reoptimizing -- you're not freezing the cameras.
>>: But you rebundle sometimes.
>>: You're just converting them from relative transformations to absolute [inaudible] ->> Daniel Martinec: Yes.
>>: -- and rebundling them all.
>>: You have some subset that is absolute and so they're globally consistent to each
other in a reprojection error, and then you have a remaining subset that is relative and
they're all chained together locally. But if you traverse this chain, there's no guarantee
that the path between the two has to be -- is going to achieve the same transformation as
if you go globally around the [inaudible].
>> Daniel Martinec: Hey, look, I tried to draw it here. I started -- my data was some
here ->>: You're chaining together the path, you're not going to end up at the same position.
>> Daniel Martinec: I chain it 1-by-1 triangle, let's say. Well, I do it by more triangles to
be fast, so I change this triangle to this, and this is magenta here, this changed with. And
then I'm going here and this changed a bit too. So I'm enforcing that it has to be changed,
that it -- there's no inconsistent ->>: The global ones are being rebundled and the remaining ones which are local are also
being rebundled as part of that, right?
>>: Right. But the path through the local does not give you the same -- if you were to
[inaudible] the camera position by chaining together the ->> Daniel Martinec: Yes.
>>: -- relative ones, it's not going to give you the same camera position ->> Daniel Martinec: Of course not. But [inaudible] two pixel reprojection error. And if
you have chain something which is connected within two pixel reprojection error, of
course, if the chain has 100 elements, then you can really fly away. But if there is just a
few of them, you cannot get really far.
And I'm really -- as I continue, I have less and less chains, so at the end I have something
which is consistent with this entirely between decent days and between these two over
one element it's consistent within two pixels only, so I cannot really be [inaudible].
>>: So another option would be to -- you don't have to grow it out from one spot, you
would kind of take subregions and grow them all simultaneously ->> Daniel Martinec: Yeah. This is what I tried. I tried to grow all triangles at the same
time. So I've got a lot of big components which are entirely overlapping each other and
sometimes some components follow some other components so the data can set, yes, it
works too. Well, but it's still too much computation. I really didn't see any advantage of
this.
>>: I was just interested, you said -- you mentioned speedup. Like how many
[inaudible] this gives over other methods ->> Daniel Martinec: Oh, I think ->>: -- [inaudible]?
>> Daniel Martinec: That speedup was for second-order cone programming, which is not
used here. This is just standard bundle adjustment with some few terms, either algebraic
or really reprojection error.
>> Rick Szeliski: Bundle adjustment using relative poses as the absolute poses, right?
>> Daniel Martinec: Yes, yes.
>> Rick Szeliski: Okay.
>> Daniel Martinec: And then a mixture of both at the final stage.
>>: [inaudible] done incorrectly multiple [inaudible] that happen that linking constraints
between [inaudible] start pulling at that [inaudible] system until much later.
>> Daniel Martinec: You're not proposing that I could ->>: There are multiple changes [inaudible], right, and then you link the gauges
[inaudible] relative?
>> Daniel Martinec: I don't exactly understand the gauges stuff. In my case ->>: [inaudible]
>> Daniel Martinec: I have each ->>: [inaudible] gauges, if you want to think of it that way.
>>: Each IJ partial reconstruction is represented by zero rotation and translation for the
first camera and by relative rotation and translation of this image band.
So this is my -- these are all my parameters which I have. And, of course, I have some
points associated with those few points. And this is my reprojection error. I call it this
way. So this is my data term. And this is already consistent, at the beginning it's below
two pixels, and as it has large weight relative to the penalty function, it stays. It stays
consistent.
>>: [inaudible]
>> Daniel Martinec: Okay. Thanks.
>> Rick Szeliski: [inaudible] that's right. I guess it means adjusting all the global
parameters. But it's a nonlinear optimization over an overcomplete set of parameters
which is the local poses, the relative poses.
And what would you say is the biggest advantage of this over running a full bundle
over -- you know, if you had the same -- in other words, you're solving an optimization
with a larger number of parameters, you could also be solving it in the global parameters
right from the beginning. What's the advantage of relaxing on local parameters?
>> Daniel Martinec: It is exactly the same argument as here. Because once it's in a
global frame, less parameters.
>> Rick Szeliski: Right.
>> Daniel Martinec: You cannot change much without introducing large errors on some
of your data.
>> Rick Szeliski: So everything is kind of stiffer in some way?
>> Daniel Martinec: Yes.
>>: Well, it's looser.
[multiple people speaking at once]
>> Rick Szeliski: Sorry. The bundle is -- the [inaudible] is stiffer and this one is looser.
>> Daniel Martinec: Yes.
>> Rick Szeliski: The new one is looser.
>>: Yeah, well, so the part that is an absolute coordinates, that's exactly like our standard
bundle, all measurements are the same and it looks just like a standard bundle.
>> Rick Szeliski: Right.
>>: But then there are all these other measurements which don't have to satisfy exactly
the constraint of chaining together into an absolute coordinate system.
>> Rick Szeliski: Right.
>>: So in that sense you can kind of softly ->> Rick Szeliski: That's softer so maybe it converges faster than -[multiple people speaking at once]
>> Rick Szeliski: You have to kind of knock all the points in all the cameras together,
right, to move ->>: I don't fully buy the argument [inaudible] for discussion.
>> Rick Szeliski: Okay. Yeah. All right. Any other questions? Okay. Thanks a lot,
then.
[applause]
>> Daniel Martinec: Thank you.
Download