>> Rick Szeliski: Okay. Good afternoon everyone. ... drifting in. So it is my real pleasure to...

>> Rick Szeliski: Okay. Good afternoon everyone. I see some more Live Log folks drifting in. So it is my real pleasure to welcome Daniel Martinec here to Microsoft Research to talk to us about his thesis research. Daniel just finished his Ph.D. at the Czech Technical University in Prague, which is where the ECCV conference was in -what was it, 2004? Right. And that's where I started taking some of the first photos that became Photosynth's later. Daniel was also at -- while he was a student, was one of -- participate on one of the teams that was working on one of these 3D reconstruction challenges that I had run for the ICCV conference. And most recently, since his graduation, in the summer he started working for Microsoft. He's going to be joining the 3D team in Boulder at Vexcel working on commercial versions of Photosynth. >> Daniel Martinec: Thank you. So what you can see here is a result of automatic pipeline which we developed in Prague. So this presentation is my Ph.D. [inaudible]. My thesis was led my Tomas Pajdla and Mirko Navara at the beginning. It was all -- all of this work was done at Center For Machine Perception at Czech Technical University in Prague. I will give a brief introduction into 3D reconstruction for multiple views, and I'll explain why it is challenging. I'll show some works of others and how they differ from our approach. After I formalize the problem, I will explain a few contributions of the thesis and some of them in more detail, and at the end I would like to tell you about my new nice results. So we have some images of the input which can be entirely unorganized. And the task is to calibrate the cameras and reconstruct what we can see in the images. If by general assumptions we don't know the positions of the cameras and orientations, once they are computed, it is quite easy to obtain reconstructions like this, if they are obtained correctly enough. And, moreover, we don't know focal lengths, but we assume that the other internal camera parameters are known, like image pixel, principal point is in image center and we have square pixel. So these kinds of images can be found for example on Flickr. This task is challenging because there can be many, like thousands or hundreds of thousands of images. This example has more than 2,000 images. There can be many occlusions, almost 100 percent, like in this case. There can be strong perspective effects, so simple affine camera model cannot be used. There can be various connections with the cameras, like narrow or wide baseline images and some images can be panoramas. Images can be either entirely unorganized or they can be in some sequences. And there is another challenge, which is caused by repetitive structures and similar objects on the images. This causes that there can be many mismatches which, however, satisfy the two-view geometry. So, for example, these four points happen to be quite similar to these four points, but they just accidentally lie on the corresponding epipolar lines. But these are different buildings. And the situation can be even worse. In this image pad there is something like 1,000 inliers to some epipolar geometry, but there is no true match because these are different buildings. So this building lies on this side in this image. And of course to obtain a consistent reconstruction of all the environment such wrong geometries have be the rejected. So a high pipeline starts with some images. We detect some regions of interest and match those similar ones. And using RANSAC, we regain some two-view geometry which can be in 3D like this, and recently tried to combine somehow such pair-wise reconstructions into a multiview reconstruction. Once that's done, it's possible to identify some image pairs which are suitable for dense stereo. This is how you [inaudible] with narrow baseline. So we can obtain such quite full dense maps which -- disparity maps which correspond to points, point clouds in space. And we simply merge these point clouds into such point cloud which can be further approximated locally by [inaudible]. And here we show just the principle planes of the [inaudible] covered with texture. Here we have millions, tens of millions of points, and here we have just a few hundred. So we call them face scales. But one can go even further. This is the result of my master student [inaudible]. He even found surfaces and used many depth maps so that it's more consistent and more complete. About all my studies were mainly about this step from two-view geometries to multiview geometries, how to combine them. There's been a number of research groups which have dealt with this problem. On video sequences, there were projects, for example, at University of Grants in Bexsell [phonetic]. A realtime 3D reconstruction was introduced two years ago. For unorganized images, there were also some universities working on this theme, like in Oxford lab in Washington. And there are two ways how to approach the problem of unorganized images. First is the sequential; however, it is quite prone to local minima. And the other is factorization, which is, on the other hand, not robust to occlusions. I will speak about this later. And there is another approach which I call batch approach introduced by Guilbert, but it's usable for affine cameras only. So what we do, we wanted to develop some methods for automatic three-year constructions, which are general and can be applicable at the same way for both organized and unorganized images. We try to use all the data at the same time. We started our studies with factorization; however, as I said, there's a problem with occlusions, so it's applicable on a few images only. Jacobs generalized factorization so that it can be used for occlusions too, but, I'll speak on this later, there are some problems with that. And our methods, most of our methods which we developed, we call gluing of partial reconstructions. And this is not only robust to occlusions but also to mismatches, as you will see. So we assume that we have some measurements in images. Here we use homogenous coordinates because it's -- in the same way you can formalize a problem even for omnidirectional cameras. So we want to compute some point in space, again, homogenous coordinates, and camera so that the point is projected by camera into the image measurement up to some unknown depth. All the blue things here are unknown stuff. The only thing known is the image measurement. So it's possible to stick all such projection equations into one larger equation, where it's easy to obtain in these [inaudible] projective depths from epipolar geometry. And if this matrix is complete, this crosses then for missing elements which are -- they're namely due to occlusions. So if this matrix is complete, it's very easy to factorize it into a product of cameras' end points using singular value decomposition. This is the result of Tomasi and Kanade and many other people who use this approach. However, if there is an occlusion, you cannot do this. So people tried to identify some large submatrix of this big matrix, do this factorization, and then to extend the solution to other parts of the matrix. However, this approach is called imputation, because you are filling some holes in this matrix. But the problem is that by filling, you introduce some error into this matrix. And this error is -- comes just from a few measurements. It's very local. And, moreover, when you are trying to extend this solution, you are using this error again and again. So it's highly dependent on when you start. And so for difficult problems with many, many occlusions, this really doesn't work. There is another way people tried to initialize these two matrices in random and then use, for example, power of factorization or [inaudible] to alternate between filling and factorizing again. However, it doesn't work really for difficult data. So what we do, we came up with -- we just look into our pair-wise reconstructions or into the camera and usually we know the internal parameters or we can estimate based on pair-wise image measurements. And so it's possible to decompose each camera into product of this internal parameter matrix rotation and translation. And I will come back to this later. So we started our research with projective factorization. We combined perspective cameras and occlusions by Jacobs. Later we extended it for omnidirectional cameras. And then I was playing for a while with this famous Dinosaur sequence which has only 36 images. And it was really a problem. I couldn't do it using Jacobs' method. The problem with that method, it's a very nice [inaudible] way. Jacobs just takes photo parts in that measurement matrix and when there are missing elements some big subspaces are generated in these samples, these photo parts, and intersects these subspaces. However, he does it in a way that he goes to complementary subspaces and he makes union then and goes back. This is the simple De Morgan rule. However, by going forth and back, he loses the connection with the image measurements. And when I found this, it was almost sold. Because it was that insufficient to reformulate the problem in the original subspaces, and this is what we did and it works amazingly well even when everything is only in projective without camera calibration. Then I studied further and came with a method which doesn't even need these projective depths. Then there was ICCV contest and there were panoramas. And so I made some detection of these panoramas. And later on on ICC '05 Fredrik Kahl's method appeared which can just take absolute rotations, which I could already compute, I will show you later, and he casts a minimization problem using second-order cone programming, and so the translation's end points can be estimated almost for free. And then I spent some time on making things more robust, which you will also see. So I think this was my biggest result, which I personally consider. And I will spend a few slides on these two remarkable results. So, as I said, we start with some pair-wise reconstructions. To make things simple, we forget about image data for now and we have just a camera viewing directions. And, further, we forget about translation for a while. So we have just relative rotation. And we want to register such relative rotations into one coordinate frame or reference frame. And how we do it, so each such pair of viewing directions has to be somehow rotated so that it aligns well in the reference code end frame. And this can be very easily described using 3-by-3 rotation matrices, so you can see what's actually happening there. And we end up with such equations which are linear. This is a great thing. There is a lot of unknowns. You can see these matrices on the right side I call consistent rotations. These are the absolute rotation, this is what we want to estimate. And these matrices are mappings between the local cognitive frame of the pair-wise reconstruction and the global coordinate frame. So it has a nice geometrical meaning. And, moreover, it's very easy to solve. We used just eigenvalue solver. The problem has global minimum, which is nice. The problem is very sparse and very well conditioned, because all the matrices there are formal matrices. It's the nicest matrix. And it's also very fast. It's a fraction of a second, even if we have thousands of image pairs. And, of course, the resulting two matrices are not exactly rotations, but they are very close to them. So it's sufficient to project these matrices to the space of [inaudible] matrices or to use SVD and just to set all the singular values to one. So this is the result of registering rotations into one coordinate frame. You can see that, for example, this camera number 22 is there several times. Because still the translations have not been registered yet. But after calling chaos method, you can see that a problem is almost solved. Here we have reprojection errors. The minimal error is below one pixel and the maximum is below 20 pixel. You may say it's quite a lot. I will arrive with a new method which doesn't have such problem. But still on a linear method, it's quite a good result and it's very fast. Rotations are in fraction of second and translations are in a few seconds. >> Rick Szeliski: Daniel, if I remember, some of the techniques, I don't remember if David Nister is one of the people who worked on this, when you know the rotations, there are solutions for the translations but points are away, points near infinity often bias and make those things not work very well. Is that correct, points of infinity have problems? Or is it only if you try to do sort of a linear version of the problem? >> Daniel Martinec: No. This is second-order cone programming. >> Rick Szeliski: Second-order cone programming takes care of that because ->> Daniel Martinec: Yeah. >> Rick Szeliski: [inaudible] >>: Okay. Okay. >> Daniel Martinec: There's no -- I have met no problem with this method, except that it relies on point measurements. >> Rick Szeliski: Okay. So it doesn't matter you're also seeing points out of infinity [inaudible]. >> Daniel Martinec: Yeah. And this is the result after final bundle adjustment. And you can see that nothing really changed, which shows that we are -- here we are very close to the local minima. Of course, we don't guarantee global minimum, the problem is nonconvex, so it's a -- and but our reverse reprojection error now is below two pixels, which is I think nice solution. And also you could see that there is a big point cloud which moved, and it's due to points visible in these two images which are in fact the panorama. It's just a closeup. And this didn't really matter in the method, if they were panoramas or not. And even it's -- it doesn't matter if the points are on some dominant planes. This is really robust too, all these things. This is another example of almost 300 images from a mountain scene. This is another shot. This is an example of the two images. So a few words to mismatch identification. As I said, in this image pair, RANSAC on epipolar geometry identifies these mismatches as inliers. How to ->> Rick Szeliski: Can I ask a question about the previous approach? So you assume that you have enough rotations estimated pair-wise that you can figure out a globally consistent set of rotations. When you have very wide field of view cameras, you can often reliably estimate rotations. As your field of view gets narrower, isn't there sort of ambiguity between rotation and the translation that makes it hard to get good pair-wise estimates? >> Daniel Martinec: Yes. So, for example, these two cameras ->> Rick Szeliski: Right. >> Daniel Martinec: -- translation, in fact, is not determined there at all. >> Rick Szeliski: Right. Those two cameras because they were taken from the same point of view. I'm saying if you basically -- consider taking out your 35 millimeter camera or you point-and-shoot and the way all these cameras power up is they're in wide-angle mode. If you force yourself to shoot a world with a 100 millimeter lens, which are pair-wise rotation estimates, just basically be poorly conditioned, or if you were flying from an airplane that was shooting aerial photos. In other words, if the images are less perspective, can you estimate rotations accurately pair-wise? >> Daniel Martinec: Well, I have not worked with such a camera. >> Rick Szeliski: Okay. >> Daniel Martinec: Yet. A strong assumption of this approach is that these [inaudible] orientations are quite well estimated. But it works somehow even if there is error of 100 degrees, which I have seen this on the mountain scene sequence. >> Rick Szeliski: Okay. So maybe because you're doing a [inaudible] least squares, those errors [inaudible]. >> Daniel Martinec: Yes. It's a bit least squares, and each of the terms has almost equal weight. So even if one is really off and there's enough -- and if others, they can make it better. >> Rick Szeliski: Okay. >> Daniel Martinec: But -- yeah. So I've made an observation that if there is a mismatch, it's usually far from other images. Sorry. Far from other point correspondences. And there is also both in image and both in depth. So I made a very simple heuristic that I fitted all my measurements in image multiplied by those depths, I fitted the Gaussian to them and removed everything which is on the [inaudible] of the 25 percent of the points which are furthest from the Gaussian center. And it turned out that all such bad guys disappeared this way. Well, there is more comments on this in the paper on the thesis. I know it's heuristic, well, but on the other hand, if you have only such two images, I cannot really -- this is really difficult to find out which points are mismatches and which are not. If you don't have any other information [inaudible]. >> Rick Szeliski: But if you're weighting points by depth, these scenes happen to have -they're closely cropped so you don't see a lot of the background, right, but if you had shots where there was just a lot -- you saw the mountains in the distance and things like that, and there was a dominant object, then, you know, things far away, if you really are weighting them by depth, will really ->> Daniel Martinec: Yeah, yeah. Of course, you have to [inaudible] all your depths so that the mean depth is one or something like that. It's kind of a projective approach. It works even if your cameras are not calibrated. So it's very general. It works for omnidirectional cameras. And especially on that mountain scene with those kilometer-distance objects it works. >> Rick Szeliski: Okay. So it's more in a projective framework, basically if you take the homogenous vectors, turn them into one norm, then it's meaningful to talk about the center of mass. >> Daniel Martinec: Yes, exactly. Yes. Of course, by this way, it happens that I remove some good data too. But I don't mind. I have no data. And 75 data is still there. And this image pair is the -- it had the largest amount of inliers, 25 percent. I haven't met any other data like that. So these are the inliers which survived this test. And by removing these mismatches, our reprojection error after doing translation estimation went down from 100 pixels to 22. But we can do even better. It's possible to pick out only four points among these -- our inlier estimate so that they represent the [inaudible] geometry almost as well as all these points. And this is very simple. Because this -- the camera matrix has just four columns, which means that when it projects the points, only a subspace of a dimension 4 is generated. So the only thing you need is four lineal independent cones. Well, and of course the question is how to pick these different or independent two points. We use the same technique as for identifying mismatches. So we fit a Gaussian here, pick one point which is most different from the others, most distant from the center of the mass and repeat this three times on a 3-dimensional subspace and 2-dimensional subspace. Because once we have one point, we can care just from the remaining data which is not explained in this dimension. And by this we then speed up vector 2,000. Well, so instead of spending four hours on translation estimation using those tens of thousands of points, we use just few hundreds of points. And it's in a few seconds with relative -- really similar results in terms of half a pixel or so. >>: Do you think this is for applying Kahl's technique, right ->> Daniel Martinec: Yes. >>: -- for the translation? But what if the points you pick aren't present in other images? When you need to find points that are also there -- or that's right. Kahl's technique is only pair-wise, right? >> Daniel Martinec: Yes. Everything so far everything was just pair-wise, which means that you have -- you can have really bad data. It works even if there is no point visible in three images. >>: [inaudible] second-order cone program is slow? >> Daniel Martinec: Yes. Yes. Exactly. It's very slow. >> Rick Szeliski: There's something there. [inaudible] says that you need three-way overlaps for reconstruction to be consistent, right? Or is that -- there's enough sort of things looking at each other from different views that there's only one sort of consistent solution, right? If you walk down a street and you never see any point more than two images, it's going to be hard to figure out exactly how fast you're walking. >> Daniel Martinec: Yes. Of course. >> Rick Szeliski: But if we're all looking at the same Dinosaur, then you don't really necessarily need to see more than three images. >> Daniel Martinec: Yes. If your camera moves along a straight path and you are using only pair-wise data, then of course you cannot estimate the scales. >> Rick Szeliski: Right. >> Daniel Martinec: This never happened in my data sets. And of course you can plug the points in three images, and I use them in a number of images in the final bundle adjustment. But it's better to use them as soon as possible, as early as possible in the optimization. And in my new stuff, they are there from the very beginning. So this is the scene which you've seen at the beginning. And he just -- I can show you the result of photos sent. No. Not this one. So it's split into three components, which doesn't say anything about multireconstruction, but it says just that I have better matching so far. So this is another example. There were several such epipolar geometries which are wrong. You can see that this window is slightly rotated here or it's entirely another window, just it happened that all of these tiles on the roof just look similar, so it's difficult really to find if it's true or not. But we use all the pair-wise reconstructions and we obtained maximum error mode in 100 pixels, and we used the simple technique, we identified the pair-wise reconstruction in which the reprojection errors were the highest and removed these and repeated that a few times. We arrived to a reprojection error of seven pixel here, and you can see that the reconstruction is really consistent. The surface is visible by different cameras fluently go one into another. So this is the most difficult example. Here you can see that we kept removing epipolar geometries but then it started to oscillate at about 30 pixels. And I think this is really difficult scene and I think the method for identifying nonexistent epipolar geometries has to be more sophisticated than this simple relying on that least squares are robust enough if you have good conditioning. So I've spent some time on this research, but then I -- then my attention was again given to enhancing precision. So this -- another example that -- seen with more than 2,000 images, it was a paper model made a few hundred years ago. It's a part of 6000 building model of Prague. And it was a very nice project; however, it was canceled, so this is the only data which I have from it. This linear technique worked on it the same way as some of the other scenes, so I'm pretty sure that it would work on even ten thousands of images and maybe more. But unfortunately I didn't have any larger data yet. And, moreover, I could not reconstruct the dense reconstruction using my software because I didn't want to spend time by rewriting it, and the depth maps just didn't fit into 16 gigabytes of memory. And then the project was canceled. So I've shown some techniques on projective and metric gluing, which we developed, which are quite accurate and relatively fast and robust. Of course, I had to write all the software, or some of the software was from others. But, for example, I found some heuristics for speeding up matching image pairs. I made some relative pose estimation when the focal lengths are entirely unknown. I worked also on line reconstruction. The software is part of multicamera self-calibration package which is widely used in the world. And I wonder that people still use it because my technique in that package is from -- it's six years old and there was never a need to replace it by the new, more robust techniques. The software was sold to a Canadian company last year and it's used not only by people at our university but by another university too. It was well accepted by vision community. We ended up second at the ICCV '05 contest. We published it in quite a few papers. So this is everything from my -- from the stuff which I published in my thesis, but there is some new stuff too. We've touched some programs there already, and I think -- I identified two programs. First is that the rotation representation is not good in that linear estimate because what I gain is only approximate rotations, not rotations because rotations satisfy some nonlinear constraints. And thanks to that, some errors can be inherited from some pair-wise reconstructions, and the translations can produce larger errors, like in 50 or 100 pixels, which may sometimes converge to a nice solution, but sometimes not. And I didn't really like to -- I wanted to make it more robust. So my new approach has no approximation, no linear estimation of absolute rotations like at once. It doesn't even use second-order cone programming. It simply takes the pair-wise reconstructions which we have and tries to modify them slightly so they are more consistent with each other. And when they are consistent, the program is sold. Because then it's very, very, very simple to chain these consistent rotations -- consistent reconstructions which are consistent both in rotation and translation and scale too. And the solution has some very nice properties, because we have low reprojection errors during the whole optimization process. Each of these pair-wise reconstructions has some image points. I use four points, which you've seen, but you can use any number of points if you want. And so on these pair-wise reconstructions the problem is solved. However, it's inconsistent with the other reconstructions. So the only thing which is needed is to add some penalty term which penalizes some constraints which are on top of these pair-wise reconstructions, and this is something which has to be solved. So I think that by keeping these reconstructions consistent with the data all the time within reprojection error of let's say 5410 pixels, we can avoid some really bad local minima. So we get better accuracy. It's raw scalable too. I've tested that on 300 views only. But as the old stuff worked on thousands, I can try it. I may try it on -- then it solved thousands or even more. And some things which are needed in the project, I work on Geosen's [phonetic] now. And they need some stuff like getting some priors to the cameras. I think this is really natural way how to do it. We have these pair-wise reconstructions, and if you have some idea of how the cameras are rotated or translated, I think it is very easy to edit, to do them, as the data terms. Well, so this is not finished, so I can show you some equations in my paper. So this is a pose, relative pose between images, between views I and J. This is relative rotation, and this is relative translation. Well, and this is composed throughout this pose, which is just chain of such relative poses. So, for example, if you have the images, we just chain these two and arrive at something like this. So this is very simple. And I want my relative pose to be consistent, to be the same as some chain of relative poses by some other relative -- by some cycle, either triangle or cycle of larger length. And so here I have some theorem that is what I want is something which has to be satisfied in the final reconstruction. And so the question is how to enforce it. Well, this -- it's quite simple to see that this rotation should be the same as the composed rotation just by these relative rotations are just multiplied. It's slightly more complicated with translations. But if you wanted to minimize in the penalty term, if you want to minimize the difference between the left and the right-hand side of this equation, there is a problem. And it is the scale in this translation, thanks to which these terms have different weights than these terms. So it turned out that it's much better to minimize the difference in essential matrices. This is one way. Or another way is to minimize reprojection error using on composed camera matrices, which is the camera matrix of image J, these are internal parameters, and instead of using HJI, I used the composed relative rotation. So we arrive at the end to two formulations: One is using these essential matrices and one is using the reprojection errors. So this first term is the standard bundle adjustment. This is just the reprojection error. Well, the reprojection error is defined. You throw into it some image points, image correspondences, some cameras. This is a pair of cameras. One of the cameras is fixed and the other is represented by relative rotation and translation. And this is my points, either four or more, and this is relative rotation. These are just the squares of the errors and the sum says that I sum over all images in that relative -- sorry, pair-wise reconstruction or columns or points. And I do this -- I sum over all pair-wise reconstructions. So this is simple. And this is the term which is [inaudible] function, which would be zero at the end, and, of course, at the beginning it is not usually because RANSAC doesn't know about the other data; it relies on that which is in two images. So this term is nonzero and it turns out that it is sufficient to minimize its weight just to initialize the weight just by -- do 1001, sorry, 1000s. No. This rate is 1,000 times smaller than 1. >>: You're doing the continuation method where the right-hand term initially is small [inaudible]. >> Daniel Martinec: Yes. Exactly. And I gradually want it to be more consistent. But if I push too much, it's possible that this term gets higher reprojection errors, and this is what I don't want. So I have to be quite careful about this. And another possibility is to minimize exactly the reprojection error but on the composed camera, which is instead of the relative rotation it has the composed rotation. So this may be closer to -- and also there is also some weight. And that's all. The only problem is how to minimize this function. And it turns out that it's possible to either have one weight, which is shared by all these triplets or how many -- how long cycles you want to have and how to not to push much, too much. And it works very well. I can show you some ->> Rick Szeliski: Are you going to say a little bit later about having hurt triangle weights? Because there's an index there, right, so it suggests that you have a different weight for each T. >> Daniel Martinec: Yes, it can be different. It's up to you. I have it here with different indices because -- because it's possible to solve this problem locally only. Imagine we have some network of our relative -- of our -- well, these are -- each line is an epipolar geometry. >> Rick Szeliski: Right. >> Daniel Martinec: And let's say in this triangle that penalty term is the highest. >>: Okay. >> Daniel Martinec: And it's possible to just take these few neighboring triangles, so this is this one, this one, yeah, and that's it. And few more triangles, which neighbor to these, which are this one and this one, and to optimize only this part. These three epipolar geometries or these other can be modified too, rotations and translations, but these are kept fixed. >> Rick Szeliski: Okay. >> Daniel Martinec: Which means that we lower some residuals here ->> Rick Szeliski: Right. >> Daniel Martinec: -- but stay consistent with the rest of the world. >> Rick Szeliski: Oh, so you use the omega IJK to basically freeze parts of the reconstruction and do subsets. >> Daniel Martinec: Yes. >> Rick Szeliski: Okay. So -- well, why don't you finish your talk. I have a suggestion. >> Daniel Martinec: Yeah, that's all. That's all. It's not finished. Also I couldn't finish the slides ->> Rick Szeliski: [inaudible] results. >> Daniel Martinec: Sorry, some results, yes. So I can show you the result of Photosynth again. So you can see that there is really some -- oh. I don't know if you can see it. There is some problem that some of the walls in this right-hand side are not perpendicular or parallel with each other. So in my method, this is -- I would say this is correct. >>: Your method produces nicer looking colors. >> Daniel Martinec: Thank you. This is one tenth of all dense points generated by dense stereo in our pipeline. And these dense points have the nice property that they can really reveal the inconsistency, which I need, because I'm trying [inaudible] algorithms which are highly, highly accurate. And here you can see that this part of the roof is there twice, maybe more times. This is a special data which I want to work on, because there are not many images around. There is another image set with this house, but it's from summer. That's why there is more features on the ground and it's very nicely reconstructed on the Internet. But this was -is winter. And so on this scene I really realized that I have to use triplet-wise correspondences, and I'm using them in this framework. It's really easy to express triplet-wise reconstructions using relative rotation and orientation only. And it is really equivalent to having triplet-wise partial reconstructions. This is something which we discussed in the last few -- couple of days. >>: So how important do you think it is to get the matches versus -- so [inaudible] my experience that I've had, so I've looked at the summer dataset of that a lot, and if the [inaudible] does not close the loop, clearly you have [inaudible]. So if you miss a match, typically what's happening in some of the datasets is that it miss -- you know, we were using SIFT style matches, and so it just won't -- it won't do that. You're using a [inaudible] variance style of matching. And so the general thing that we've seen is that if we get the matches, we typically don't have too much problem of the loop opening. And so if the match were closed, that would -- which we were able to do when we kind of raised the resolution when we're running that, then the optimization tended to have no problem [inaudible]. >> Daniel Martinec: Yeah. I -- if -- of course, if I don't have the loop closed, I cannot do anything with it. And if I have the data and it's weakly conditioned I have just -- in some of my datasets I have just eight inliers and I'm happy of them and I can make use of them and I can really make it consistent on all of these points. So, of course, the more images you have, it's better. But I don't think without the log groups you cannot expect that this would be somehow better. This is just a standard bundle adjustment. Nothing more. Maybe I -- it would [inaudible] local minima. But I need the hidden data too. >>: So in no work can you force it, they both use the same matches? >> Daniel Martinec: No, no. Yes, of course. That's another point. Yeah. The pipelines are different. We have many nice features like [inaudible] and many types of MSERs, and we use local affine frames and DCT coefficients for descriptors and SIFT features too. I found out on the mountain scene that the SIFT features are not really good, because some of them, pair-wise reconstructions have as low as 2 percent of inliers, because everything is stone. Well, so MSERs with local affine frames work much better than SIFTs. That's why for identifying really epipolar geometries I use MSERs only in [inaudible] I use SIFTs for identifying the data, which may help. But I don't rely on SIFTs. >>: Just to clarify, are you saying that you take multiple subsets of views to answer to multiple bundles and that's what all about the cost function at 14, and then you add consistency terms between them, and then you sort of pump up [inaudible] until ->> Daniel Martinec: Yes. >>: -- it goes to infinity and then they have to be consistent, you can just roll them out [inaudible] consistent up to [inaudible]. >> Daniel Martinec: Yeah, I hoped originally that there could be just one weight, which is shared by all the close, all the loops. And I thought, okay, but just raising it by small amounts I must arrive to something which is really consistent. But unfortunately this is not true. I think it's because this is just least squares and it has many, many local minima. So then I arrived with this -- with this local method which just tries to repair the highest residuals. So I have like -- graph like this where each -here I have triangles. And at the beginning they all have large residuals. Like this. And so I identify the triangle which has the highest residual, like this one, and do something on it, three steps [inaudible] so that the residual gets, I don't know, to 50 percent or 70 percent, or how much you want -- how much you want to be fast. And together with that some neighboring residuals go down too. So now I'll -- I improved the consistency between the data. And now I identify again and improve it. And at the end I have like this. And so first I start with these essential closure constraints because they are applicable even when your points are behind the composed cameras. Because this can happen. If your rotation -- relative rotations, they can have error like -- like a hundred degrees, they can be a hundred degrees away, and it's very easy that some of the points are just behind the camera. That's why you cannot use this formulation. Because it's nonsense. If your [inaudible] point is not in front of camera, it's nonsense. That's why first I do a few steps using this part, and then I switch to this, which really is reprojection error. That's nothing else. And at the end, of course, it's not entirely consistent due to noise, due to image noise. So I -- so I start with some triangle, make the 3D reconstruction of it and add some other triangles. And usually I don't have to bundle again. Usually everything is below two pixels. So only when it grows to, say, 5410 pixels, usually not more, unless -- only sometimes when they are panoramas, and then the translations. So I don't know exactly. So I can just grow the solution. And this is something like your seed -- how it's called -growing. Yeah. But now I have everything preregistered, very, very tightly, so ->>: So once you've got the [inaudible] did this bias even grow [inaudible] consistency [inaudible] groups that are created every time you add a view, would this do as well as [inaudible]. >> Daniel Martinec: I've tried to do something similar. I've tried to grow like two components which somehow overlap somewhere, or maybe more components, and tried to enforce similar constraints like this, on these sharing cameras. Well, that may be other constraints. I haven't tried too much of that. But it turned out that once you put this big set of cameras into one consistent system and -- they are very, very tight and it's really difficult to do anything with them. So it's much better. It seems to me that it's much better to have everything just pair-wise, make everything consistent and just grow than doing this growing here. I don't know. I -- I really had problems with this. It was just -- just inconsistent and it just didn't want to be together on these overlapping cameras. Which is the same case as you told, which if you have -- sorry. If you have this -- no. No. This is similar case. >> Rick Szeliski: That's the loop closure. >> Daniel Martinec: Yes. I'm really skeptical. If this is in one rigid frame and you try to do something here ->>: [inaudible] the whole thing but [inaudible] say this is a new loop so I'm going to really pay attention pulling that together. >> Daniel Martinec: Yeah, I know, but ->>: [inaudible] >> Rick Szeliski: Your triangle example doesn't have a loop in it. But if you did actually drew a loop, would you run into the same problem? >> Daniel Martinec: Oh, yeah, it has many loops. I have many loops in my data. >> Rick Szeliski: So as you're going along and you're kind of saying, well, I'm just going to keep adding things in, at some point you might close the loop because you're starting with something. After you've run this algorithm, when you want to pull out a final consistent geometry, right, camera set of matrices, you say start a triangle and kind of walk out and don't rebundle unless your reprojection area gets too big. >> Daniel Martinec: Yes, exactly. >> Rick Szeliski: But you're going to keep doing that. You didn't tell us what happens when the two ends meet. >> Daniel Martinec: Yeah. When I'm doing this -- so let's say I make this into one consistent frame, one rigid frame, but I keep the connections with the rest of the world, which is still pair-wise. So even if I add something and maybe reprojection errors can grow to ten pixels here, but I bundle again with everything others. And this repairs a bit. So even if the loop should be somewhere else, I get there. Because, yeah, I'm growing here, but this shouldn't be here, it should be here. >>: [inaudible] final triangle. >> Daniel Martinec: Hmm? >>: Right? So you're guaranteeing that everything is globally consistent in the part that you're growing out ->> Daniel Martinec: Yes. >>: -- and that all of the relative orientations locally are consistent. >> Daniel Martinec: Yes. Even ->>: But at some point when you had that last triangle that ->> Daniel Martinec: No. >>: -- that generates the cycle ->> Daniel Martinec: Okay. I can ->>: -- you're going to have to reconcile it, right? >> Daniel Martinec: This is difficult to draw. So I have something ->>: There's nothing that guarantees that when you add [inaudible] ->> Daniel Martinec: No, there is, there is ->>: -- [inaudible] that they are absolutely consistent [inaudible]. >> Daniel Martinec: So I -- let's say I grew to some point like this, and still here I have some triangles, which are not in that big reference frame. So I continue growing. And now it turns out it should be like this. Well, this camera should be here but rather here. But because it's connected to the pair-wise reconstructions, the rest has to do some changes. >>: I think I get what he's saying, which is that as you freeze certain cameras to make them into absolute global coordinates, you still keep reoptimizing the other pair-wise [inaudible] so if there is a loop closure, the things that aren't frozen yet are going to keep reoptimizing themselves. So it's like you know how when you pull a tire onto a rim when you're fixing a bicycle tire, and sort of other parts that aren't quite over the rim yet have to start moving so the whole thing bursts? So I'm guessing that maybe what Daniel is saying is that even if you're putting final numbers on certain camera poses because you want those global numbers, the ones that are still not finalized participate in the optimization, so they're pulling you into towards sort of a closure. >>: Yeah, but it's not exactly like a bike tire because it's just these relative -- so you're only opposing relative strengths locally on the part that's not on the rim. >>: Yeah. >>: Everything can move [inaudible]. [multiple people speaking at once] >>: [inaudible] remaining error. >>: Nothing that guarantees that these things sum up to some absolute correct path. >>: Yeah. I mean, the things that are frozen, if you're doing a freezing strategy, they're frozen, you'll never get to reoptimize the base of the [inaudible] you have to hope that the pair-wise stuff distributed at the error [inaudible] ->>: No, you're reoptimizing -- you're not freezing the cameras. >>: But you rebundle sometimes. >>: You're just converting them from relative transformations to absolute [inaudible] ->> Daniel Martinec: Yes. >>: -- and rebundling them all. >>: You have some subset that is absolute and so they're globally consistent to each other in a reprojection error, and then you have a remaining subset that is relative and they're all chained together locally. But if you traverse this chain, there's no guarantee that the path between the two has to be -- is going to achieve the same transformation as if you go globally around the [inaudible]. >> Daniel Martinec: Hey, look, I tried to draw it here. I started -- my data was some here ->>: You're chaining together the path, you're not going to end up at the same position. >> Daniel Martinec: I chain it 1-by-1 triangle, let's say. Well, I do it by more triangles to be fast, so I change this triangle to this, and this is magenta here, this changed with. And then I'm going here and this changed a bit too. So I'm enforcing that it has to be changed, that it -- there's no inconsistent ->>: The global ones are being rebundled and the remaining ones which are local are also being rebundled as part of that, right? >>: Right. But the path through the local does not give you the same -- if you were to [inaudible] the camera position by chaining together the ->> Daniel Martinec: Yes. >>: -- relative ones, it's not going to give you the same camera position ->> Daniel Martinec: Of course not. But [inaudible] two pixel reprojection error. And if you have chain something which is connected within two pixel reprojection error, of course, if the chain has 100 elements, then you can really fly away. But if there is just a few of them, you cannot get really far. And I'm really -- as I continue, I have less and less chains, so at the end I have something which is consistent with this entirely between decent days and between these two over one element it's consistent within two pixels only, so I cannot really be [inaudible]. >>: So another option would be to -- you don't have to grow it out from one spot, you would kind of take subregions and grow them all simultaneously ->> Daniel Martinec: Yeah. This is what I tried. I tried to grow all triangles at the same time. So I've got a lot of big components which are entirely overlapping each other and sometimes some components follow some other components so the data can set, yes, it works too. Well, but it's still too much computation. I really didn't see any advantage of this. >>: I was just interested, you said -- you mentioned speedup. Like how many [inaudible] this gives over other methods ->> Daniel Martinec: Oh, I think ->>: -- [inaudible]? >> Daniel Martinec: That speedup was for second-order cone programming, which is not used here. This is just standard bundle adjustment with some few terms, either algebraic or really reprojection error. >> Rick Szeliski: Bundle adjustment using relative poses as the absolute poses, right? >> Daniel Martinec: Yes, yes. >> Rick Szeliski: Okay. >> Daniel Martinec: And then a mixture of both at the final stage. >>: [inaudible] done incorrectly multiple [inaudible] that happen that linking constraints between [inaudible] start pulling at that [inaudible] system until much later. >> Daniel Martinec: You're not proposing that I could ->>: There are multiple changes [inaudible], right, and then you link the gauges [inaudible] relative? >> Daniel Martinec: I don't exactly understand the gauges stuff. In my case ->>: [inaudible] >> Daniel Martinec: I have each ->>: [inaudible] gauges, if you want to think of it that way. >>: Each IJ partial reconstruction is represented by zero rotation and translation for the first camera and by relative rotation and translation of this image band. So this is my -- these are all my parameters which I have. And, of course, I have some points associated with those few points. And this is my reprojection error. I call it this way. So this is my data term. And this is already consistent, at the beginning it's below two pixels, and as it has large weight relative to the penalty function, it stays. It stays consistent. >>: [inaudible] >> Daniel Martinec: Okay. Thanks. >> Rick Szeliski: [inaudible] that's right. I guess it means adjusting all the global parameters. But it's a nonlinear optimization over an overcomplete set of parameters which is the local poses, the relative poses. And what would you say is the biggest advantage of this over running a full bundle over -- you know, if you had the same -- in other words, you're solving an optimization with a larger number of parameters, you could also be solving it in the global parameters right from the beginning. What's the advantage of relaxing on local parameters? >> Daniel Martinec: It is exactly the same argument as here. Because once it's in a global frame, less parameters. >> Rick Szeliski: Right. >> Daniel Martinec: You cannot change much without introducing large errors on some of your data. >> Rick Szeliski: So everything is kind of stiffer in some way? >> Daniel Martinec: Yes. >>: Well, it's looser. [multiple people speaking at once] >> Rick Szeliski: Sorry. The bundle is -- the [inaudible] is stiffer and this one is looser. >> Daniel Martinec: Yes. >> Rick Szeliski: The new one is looser. >>: Yeah, well, so the part that is an absolute coordinates, that's exactly like our standard bundle, all measurements are the same and it looks just like a standard bundle. >> Rick Szeliski: Right. >>: But then there are all these other measurements which don't have to satisfy exactly the constraint of chaining together into an absolute coordinate system. >> Rick Szeliski: Right. >>: So in that sense you can kind of softly ->> Rick Szeliski: That's softer so maybe it converges faster than -[multiple people speaking at once] >> Rick Szeliski: You have to kind of knock all the points in all the cameras together, right, to move ->>: I don't fully buy the argument [inaudible] for discussion. >> Rick Szeliski: Okay. Yeah. All right. Any other questions? Okay. Thanks a lot, then. [applause] >> Daniel Martinec: Thank you.

>> Rick Szeliski: Okay. Good afternoon everyone. ... drifting in. So it is my real pleasure to...

Related documents

Products

Support

&gt;&gt; Rick Szeliski: Okay. Good afternoon everyone. ... drifting in. So it is my real pleasure to...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Rick Szeliski: Okay. Good afternoon everyone. ... drifting in. So it is my real pleasure to...