CZECH TECHNICAL UNIVERSITY IN PRAGUE DOCTORAL THESIS STATEMENT Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Michal Havlena Incremental Structure from Motion for Large Ordered and Unordered Sets of Images Ph.D. Programme: Electrical Engineering and Information Technology, P2612 Branch of study: Artificial Intelligence and Biocybernetics, 3902V035 Doctoral thesis statement for obtaining the academic title of “Doctor”, abbreviated to “Ph.D.” Prague, June 2012 The doctoral thesis was produced in full-time manner during Ph.D. study at the Center for Machine Perception of the Department of Cybernetics of the Faculty of Electrical Engineering of the CTU in Prague. Candidate: Mgr. Michal Havlena Department of Cybernetics Faculty of Electrical Engineering of the CTU in Prague Supervisor: Ing. Tomáš Pajdla, Ph.D. Department of Cybernetics Faculty of Electrical Engineering of the CTU in Prague Opponents: ............................................................................................ ............................................................................................ ............................................................................................ The doctoral thesis statement was distributed on ................... . The defence of the doctoral thesis will be held on .................... at ....... a.m./p.m. before the Board for the Defence of the Doctoral Thesis in the branch of study Artificial Intelligence and Biocybernetics in the meeting room No. ......... of the Faculty of Electrical Engineering of the CTU in Prague. Those interested may get acquainted with the doctoral thesis concerned at the Dean Office of the Faculty of Electrical Engineering of the CTU in Prague, at the Department for Science and Research, Technická 2, 166 27 Prague 6. Prof. Ing. Vladimı́r Mařı́k, DrSc. Chairman of the Board for the Defence of the Doctoral Thesis in the branch of study Artificial Intelligence and Biocybernetics Department of Cybernetics, Karlovo náměstı́ 13, 121 35 Prague 2 Incremental Structure from Motion for Large Ordered and Unordered Sets of Images Contents 1 Problem Formulation 2 2 Contributions 3 3 State of the Art 5 4 City Modeling from Google Street View 6 5 Omnidirectional Sequence Stabilization 7 6 Randomized Structure from Motion 9 7 Image Set Reduction and Prioritized SfM 10 8 Conclusions 12 Keywords: Structure from Motion, Omnidirectional Vision, City Modeling Note: Full text of the thesis is available at ftp://cmp.felk.cvut.cz/ pub/cmp/articles/havlena/Havlena-TR-2012-13.pdf 1 Problem Formulation In this thesis, we deal with the problem of modeling large predominantly static scenes from ordered, i.e. sequential, and unordered image sets comprising thousands of images when there is no further a priori information about the scene captured available. To solve such a task, we build upon the foundations of multiple view geometry [14], namely Structure from Motion. The goal of Structure from Motion (SfM) is to recover both the positions of the 3D points of the scene (structure) and the unknown poses of the cameras capturing the scene (motion, external camera calibration). Knowing camera poses, one can use multi-view stereo 3D surface reconstruction methods [9, 17] to construct compelling 3D surface models of the scene, see Figure 1. There is no closed form solution for solving the SfM problem [30]. The problem is highly non-linear and the search for the optimal model parameters by minimizing the reprojection error is likely to get stuck in a local minimum. The methods described in this thesis fall into the group of the “incremental” methods which use simple components such as epipolar geometry computation for a pair of images, 3D point triangulation, and camera resectioning to build the resulting model incrementally, starting from one or more seed reconstructions and connecting additional cameras until the whole scene is reconstructed [37], all interleaved with necessary non-linear optimization [22]. The art of such methods resides in the design of a computational “pipeline” which combines the aforementioned components in order to create a reconstruction procedure which is both robust and efficient for the input image data. When the size of the image set grows, exhaustive pairwise image matching commonly used for revealing the structure of the data becomes infeasible. (a) (b) (c) Figure 1: (a) Sample images from 316 images of Marseille Centre de la Vieille Charité. (b) External camera calibration and a sparse 3D point cloud obtained using method [38]. (c) 3D surface reconstruction computed by method [17]. 2 2 Contributions The contribution of the thesis is related to large scale Structure from Motion (SfM) from both ordered and unordered image sets. The research on sequential SfM was conducted in collaboration with my colleague, Akihiko Torii, and the research on SfM from unordered data was done by me. Specifically, my contributions are the following: • Use of visual indexing for image pair/triplet selection. We avoid the most time consuming step of large scale SfM from unordered image sets, the computation of all pairwise matches and geometries, by sampling pairs of images and estimating visual overlap using the detected occurrences of visual words. The evaluation of the similarity scores by computing scalar products of so-called tf-idf vectors [36] is also quadratic in the number of images in the set but scalar product is a much simpler operation than full feature matching which leads to a significant speedup of SfM. Furthermore, we proposed to sample triplets of images instead of pairs for the seeds of the reconstruction because 3D points verified in three views are more likely to be correct. The constructed atomic 3D models are merged together to give the final large scale 3D model at later stages of the computation. • Image set reduction by applying a graph algorithm (CDS). The idea of using visual indexing for SfM was further extended in order to be able to reconstruct image sets with uneven image coverage, i.e. community image sets of cities with landmarks, efficiently. A small subset from the set of input images is selected by computing the approximate minimum connected dominating set of a graph with vertices being the images and edges connecting the visually similar images by a fast polynomial algorithm [13]. This kind of reduction guarantees, to some extent, that the removed images have visual overlap with at least some images left in the set and therefore can be connected to the resulting 3D model later. • Task ordering using a priority queue. We use task prioritization to avoid spending too much time in a few difficult matching problems instead of exploring other easier options. Compared to our previous work having the computation spit in several stages [15], the usage of a priority queue for interleaving different “atomic 3D model construction” and “image connection” tasks facilitates obtaining reasonable reconstructions in limited time. The priorities of the individual tasks are set according to image similarity and the history of the computation. 3 Joint contributions follow: • Computation of dominant apical angle (DAA). When performing sequential SfM by chaining pairwise epipolar geometries [14], the reconstruction usually fails when the amount of translation between consecutive cameras is not sufficient. We demonstrate that the amount of translation can be reliably measured for general as well as planar scenes by the most frequent apical angle, the angle under which the camera centers are seen from the perspective of the reconstructed scene points. By selecting only image pairs which have sufficient DAA, one is able to easily reconstruct even sequences with variable camera motion speed. • Sequence bridging by visual indexing. We extend the known concept of loop closing, e.g. [16], which tries to correct the trajectory of the camera once the same place is re-visited, by searching for all the trajectory loops at once based on co-occurring visual words. Geometrically verified loop candidates are added to the model as new constraints for bundle adjustment which closes the detected loops as it enforces global consistency of camera poses and 3D structure in the sequence. • Image stabilization using non-central cylindrical image generation. A new technique for omnidirectional image rectification based on stereographic image projection was introduced as an alternative to central cylindrical image generation. We show that non-central cylindrical images are suitable for people recognition with classifiers trained on perspective data, e.g. [6], once the images are stabilized w.r.t. the ground plane. • Using cone test instead of reprojection error. When verifying 2D3D matches in a RANSAC loop [7], we do not rely on the widely used reprojection error but make use of the fact that the projections of the 3D point to all the related images are known and use a “cone test” instead. Two pixels wide pyramids are cast through the corresponding pixel locations in the related cameras and an LP feasibility task is solved to decide whether the intersection of the “cones” is non-empty or not. This allows for accepting a correct match even if the currently estimated 3D point location is incorrect without modeling the probability distribution of the location of the 3D point explicitly. It is worth mentioning that the methods were implemented to work with the general central camera model, covering the most common special cases including (i) perspective cameras, (ii) fish-eye lenses, (iii) equirectangular panoramas, and (iv) cameras calibrated by the polynomial omnidirectional model. 4 3 State of the Art The complexity of large scale SfM computation is quite different for ordered and unordered image sets as image order gives a clue which pairs of images should have an overlapping field of view and are therefore suitable for processing. On the other hand, the methods for fast selection of promising image pairs can be used also for ordered image sets to improve the consistency of the resulting models via loop closing. Ordered Image Set Processing Short baseline SfM using simple image features [5], which performs real-time detection and matching, recovers camera poses and trajectory sufficiently well when all camera motions between consecutive frames in the sequence are small. On the other hand, wide baseline SfM based methods, which use richer features such as MSER [25], Laplacian-Affine, Hessian-Affine [27], SIFT [23], and SURF [2], are capable of producing feasible tentative matches under large changes of visual appearance between images induced by rapid changes of camera pose and illumination. Work [10] presented SfM based on wide baseline matching of SIFT features using a single omnidirectional camera and demonstrated the performance on indoor environments. The state of the art technique for finding relative camera poses from image matches first establishes tentative matches by pairing image points with mutually similar features and then uses RANSAC [7, 14, 4] to look for a large subset of the set of tentative matches which is, within a predefined threshold ε, consistent with an epipolar geometry (EG) [14]. Unfortunately, this strategy does not always recover the epipolar geometry generated by the actual camera motion, which has been observed in [20, 29, 41]. Often, there are more models which are supported by a large number of matches. Then, the chance that the correct model, even if it has the largest support, will be found by running a single RANSAC is small. Work [20] suggested to generate models by randomized sampling as in RANSAC but to use soft (kernel) voting for a physical parameter instead of looking for the maximal support. Work [26] demonstrated 3D modeling from perspective images exported from Google Street View images using piecewise planar structure constraints. Another recent related work [39] demonstrated the performance of SfM which employs guided matching by using epipolar geometries computed in previous frames, and robust camera trajectory estimation by computing camera orientations and positions individually for the calibrated perspective images acquired by Point Grey Ladybug Spherical Digital Video Camera System [33]. 5 In [18], loop closing capable of removing the drift error of sequential SfM is achieved by merging partial reconstructions of overlapping sequences which are extracted using an image similarity matrix [36, 19]. Work [34] finds loop endpoints by using the image similarity matrix and verifies the loops by computing the rotation transform between the pairs of origins and endpoints under the assumption that the positions of the origin and the endpoint of each loop coincide. Furthermore, they constraint the camera motions on a plane to reduce the number of parameters in bundle adjustment. Unordered Image Set Processing Most of the state of the art techniques for 3D reconstruction from unordered image sets [35, 3, 42, 24] start the computation by performing exhaustive pairwise image matching in order to reveal the structure and connectivity of the data. Bundler [37] uses exhaustive pairwise image feature matching and epipolar geometry computation to create an image graph which is later used to lead the reconstruction. By finding the skeletal set [38] of this graph, the reconstruction time improves significantly but the time spent on image matching remains the same. Recent advancement of the aforementioned technique [1] abandons exhaustive pairwise image matching by using shared occurrences of visual words [31, 36] to match only the ten most promising images per each input image. A similar approach was used in Google Maps to construct the models of several popular landmarks from all around the world using usercontributed Picasa and Panoramio photos [12]. Another possible approach to reducing the number of necessary pairwise image feature matchings lies in reducing the number of images to be processed because the input image set may be highly redundant. The approach presented in [21] clusters the input images using the GIST [32] descriptor giving raise to “iconic images”. These images and the pairwise geometric relationships between them define an “iconic scene graph” that captures all the important aspects of the original image set. In [8], the method has been re-implemented in order to be highly parallel and therefore suitable for GPU computing. 4 City Modeling from Google Street View When constructing 3D models of large city areas, it is beneficial to use 360◦ field of view images, see Figure 2, as it increases robustness against occlusions. The main contribution of the presented method lies in demonstrating that one can achieve SfM from a single sparse omnidirectional sequence with only an 6 (a) (b) Figure 2: Camera trajectory computed by SfM. (a) Camera positions (red circles) exported into Google Earth [11]. (b) The 3D model representing 4,799 camera positions (red circles) and 123,035 3D points (color dots). approximate knowledge of calibration as opposed to [5, 39] where the models are computed from dense sequences and with precisely calibrated cameras. The proposed SfM pipeline is an extension of work [40] which demonstrated the performance of the recovery of camera poses and trajectory on the image sequence acquired by a single fish-eye lens camera. Loop closing is facilitated by visual indexing. SURF [2] descriptors of each image are quantized into visual words and term frequency–inverse document frequency (tf-idf) vectors [36, 19] are computed. Image similarity matrix M is constructed by computing the cosines of angles between normalized tf-idf vectors, i.e. their scalar products, for all pairs of images. The 1st to 50th diagonals of M are zeroed in order to exclude very small loops and for each image Ii in the sequence, a candidate Ij of the endpoint of the loop which starts from Ii is selected as the one having the highest similarity score in the i-th row of M. Next, the candidate image Ij is verified by solving the camera resectioning [28]. If the inlier ratio is higher than 70%, camera resectioning is considered successful and the candidate image Ij is accepted as the endpoint of the loop. The validated candidates are used to give additional constraints on the final bundle adjustment [22]. 5 Omnidirectional Sequence Stabilization In order to make the wide baseline SfM pipeline capable to recover camera poses and trajectories even from sequences that have large differences in the 7 X τ α α τ Rx C x t C Figure 3: The apical angle τ at the point X reconstructed from the correspondence (x, x ) relatively depends on the length of the camera translation t and on the distances of X from the camera centers C, C . amount of camera translation between consecutive frames, a keyframe selection method based on DAA computation is proposed. Secondly, stabilized noncentral cylindrical images are generated to facilitate pedestrian detection using detectors trained on perspective images using the known ground plane position assumption. Measuring the Amount of Camera Translation by DAA. Having m matches (xi , xi ) and the essential matrix E computed from them, we can reconstruct 3D points Xi . Figure 3 shows a point X reconstructed from an image match (x, x ). For each point X, the apical angle τ , which measures the length of the camera translation from the perspective of the point X, is computed. If the cameras are related by pure rotation, all angles τ are equal to zero. The larger is the camera translation, the larger are the angles τ . For a given E and m matches (xi , xi ), one can select the decomposition of E to R and t, which reconstructs the largest number of 3D points in front of the cameras. The apical angle τi , corresponding to the match (xi , xi ), is computed by solving a set of linear equations for the relative distances αi , αi αi xi = αi R xi − t (1) in the least square sense and by using the law of cosines 2 2 αi αi cos(τi ) = αi 2 + αi − t2 . (2) For a small translation w.r.t. the distance to the scene points, approximation αi = αi can be used and the apical angle τi becomes a linear function of t. 8 (a) (b) (c) (d) Figure 4: Image stabilization and transformation. (a) The camera poses and the world 3D points reconstructed by our SfM visualized from a bird’s eye view. (b) Original images. (c) Non-stabilized images. (d) Stabilized images. Omnidirectional Image Stabilization. The recovered camera poses and trajectory can be used to rectify the original images to the stabilized ones. Image stabilization is beneficial e.g. for facilitating visual object recognition where (i) objects can be detected in canonical orientations and (ii) ground plane position can further restrict feasible object locations. When the sequence is captured by walking or driving on the roads, the images can be stabilized w.r.t. the ground plane with a natural assumption that the motion direction is parallel to the ground plane. If there exists no constraint on camera motion in the sequence, the simplest way of stabilization is to rectify images w.r.t. the up vector in the coordinate system of the first camera, see Figure 4. 6 Randomized Structure from Motion The computation of the randomized SfM pipeline for unordered image sets consists of four consecutive steps which are executed one after another: (i) computing image similarity matrix, (ii) constructing atomic 3D models from camera 9 Figure 5: Overview of the pipeline. Input images are described by SURF and an image similarity matrix is computed. Atomic 3D models are constructed from camera triplets, merged together into partial reconstructions, and finally single cameras are glued to the largest partial reconstruction. triplets, (iii) merging partial reconstructions, and (iv) gluing single cameras to the best partial reconstruction, see Figure 5. Image similarity matrix M is used to select triplets of cameras suitable for constructing atomic 3D models. The maximum score in the matrix gives a pair of cameras Ci and Cj . Then, three “third camera” candidates are found and atomic 3D models are constructed for each of the candidates. The resulting models are ranked by the quality score, which checks (i) whether there is a sufficient number of 3D points with large apical angles and (ii) the uniformity of image coverage by the projections of reconstructed 3D points, and the model with the highest quality score is selected. Denoting the third camera corresponding to the selected atomic 3D model as Ck , cameras Ci , Cj , and Ck are removed from future selections by zeroing rows and columns i, j, and k of M. When the atomic 3D model construction step finishes, each accepted atomic 3D model gives raise to a partial reconstruction and merging guided by a similarity matrix containing scores between selected atomic 3D models is performed. During the merging step, the partial reconstructions are connected together forming larger partial reconstructions containing the union of cameras and 3D points of the connected reconstructions. Finally, in the gluing step, the best partial reconstruction is selected as the one containing the highest number of cameras and the poses of the cameras which are not contained in it are estimated [28]. Cone test is used to evaluate the support of individual RANSAC samples during merging and gluing. 7 Image Set Reduction and Prioritized SfM Unstructured web collections often contain a large number of very similar images of landmarks while, on the other hand, image sequences often have a very limited overlap between images. To speed up the reconstruction, it is desirable 10 Figure 6: Schematic visualization of the computation. The task retrieved from the head of the priority queue can be either an atomic 3D model construction task (dark gray) or an image connection task (light gray). Unsuccessful atomic 3D model construction (–) inserts another atomic 3D model construction task with the priority key doubled into the queue, a successful one (+) inserts five image connection tasks. Unsuccessful image connection (–) inserts the same task again with the priority key doubled, a successful one (+) inserts a new image connection task. Merging of overlapping 3D models is called implicitly after every successful image connection if the overlap is sufficient. to select a subset of the input images in such a way that all the remaining images have a significant visual overlap with at least one image from the selected ones, so the connectivity of the resulting model should not be damaged. For selecting such a subset of input images, the approximate minimum connected dominating set can be computed by a fast polynomial algorithm [13] on the graph constructed according to the estimated visual overlap. This is closely related to the maximum leaf spanning tree algorithm employed in [38] but the composition of the graph is less computationally demanding in our case. Compared to [21] using GIST, our method is more robust to viewpoint changes. The proposed SfM pipeline, reconstructing only the selected subset of images, still uses the atomic 3D models constructed from camera triplets as the basic elements of the reconstruction but the strict division of the computation into steps is relaxed by introducing a priority queue which interleaves different reconstruction tasks, see Figure 6, in order to get a good scene covering recon- 11 (a) (b) Figure 7: The largest partial 3D models reconstructed from the reduced image sets after 6 hours of computation. (a) Omnidirectional image set CASTLE (1,063 images). (b) Perspective image set VIENNA (1,008 images). struction in limited time. Our aim here is to avoid spending too much time in a few difficult matching problems by exploring other easier options which lead to a comparable resulting 3D model in shorter computational time compared to the previous method, see Figure 7 for the obtained 3D models. 8 Conclusions The thesis contributes to improving scalability and efficiency of Structure from Motion computation from both ordered and unordered image sets. In particular, the usefulness of visual indexing for estimating visual overlap between images was shown. When visual indexing is used, exhaustive pairwise image matching can be avoided, the sizes of redundant image sets can be significantly reduced, and the SfM computation can be properly prioritized. The reconstruction pipelines are accessible to the registered users through our web-based interface, http://ptak.felk.cvut.cz/sfmservice, and were successfully used to reconstruct many challenging image sets. References [1] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski. Building Rome in a day. In ICCV’09, pages 72–79, 2009. 12 [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). CVIU, 110(3):346–359, June 2008. [3] M. Brown and D. Lowe. Unsupervised 3D object recognition and reconstruction in unordered datasets. In 3-D Digital Imaging and Modeling (3DIM), pages 56–63, 2005. [4] O. Chum and J. Matas. Matching with PROSAC: Progressive sample consensus. In CVPR’05, pages I:220–226, 2005. [5] N. Cornelis, K. Cornelis, and L. Van Gool. Fast compact city modeling for navigation previsualization. In CVPR’06, pages II:1339–1344, 2006. [6] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mobile vision system for robust multiperson tracking. In CVPR’08, pages 1–8, 2008. [7] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, 24(6):381–395, June 1981. [8] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, Ch. Wu, Y.-H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys. Building rome on a cloudless day. In ECCV’10, pages IV:368–381, 2010. [9] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. In CVPR’07, 2007. [10] T. Goedemé, M. Nuttin, T. Tuytelaars, and L. Van Gool. Omnidirectional vision based topological navigation. IJCV, 74(3):219–236, September 2007. [11] Google. Google Earth – http://earth.google.com, 2004. [12] Google. Photo tours in Google Maps – http://maps.google.com/phototours, 2012. [13] S. Guha and S. Khuller. Approximation algorithms for connected dominating sets. Algorithmica, 20(4):374–387, 1998. [14] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2003. [15] M. Havlena, A. Torii, J. Knopp, and T. Pajdla. Randomized structure from motion based on atomic 3D models from camera triplets. In CVPR’09, pages 2874–2881, 2009. [16] K.L. Ho and P. Newman. Detecting loop closure with scene sequences. IJCV, 74(3):261–286, September 2007. [17] M. Jancosek and T. Pajdla. Multi-view reconstruction preserving weakly-supported surfaces. In CVPR’11, pages 3121–3128, 2011. [18] M. Klopschitz, C. Zach, A. Irschara, and D. Schmalstieg. Generalized detection and merging of loop closures for video sequences. In 3DPVT’08, pages 137–144, 2008. [19] J. Knopp, J. Šivic, and T. Pajdla. Location recognition using large vocabularies and fast spatial matching. Research Report CTU–CMP–2009–01, CMP Prague, January 2009. [20] H. Li and R. Hartley. A non-iterative method for correcting lens distortion from nine point correspondences. In OMNIVIS’05, pages 1–8, 2005. [21] X.W. Li, C.C. Wu, C. Zach, S. Lazebnik, and J.-M. Frahm. Modeling and recognition of landmark image collections using iconic scene graphs. In ECCV’08, pages I:427–440, 2008. 13 [22] M.I.A. Lourakis and A.A. Argyros. The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg-Marquardt algorithm. Tech. Report 340, Institute of Computer Science – FORTH, August 2004. [23] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, November 2004. [24] D. Martinec and T. Pajdla. Robust rotation and translation estimation in multiview reconstruction. In CVPR’07, pages 1–8, 2007. [25] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. IVC, 22(10):761–767, September 2004. [26] B. Mičušı́k and J. Košecká. Piecewise planar city 3D modeling from street view panoramic sequences. In CVPR’09, pages 2906–2912, 2009. [27] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 65(1-2):43–72, November 2005. [28] D. Nistér. A minimal solution to the generalized 3-point pose problem. In CVPR’04, pages I:560–567, 2004. [29] D. Nistér and C. Engels. Estimating global uncertainty in epipoloar geometry for vehiclemounted cameras. In SPIE – Unmanned Systems Technology VIII, pages 62301L:1–12, 2006. [30] D. Nistér, F. Kahl, and H. Stewénius. Structure from motion with missing data is np-hard. In ICCV’07, pages 1–7, 2007. [31] D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In CVPR’06, pages II: 2161–2168, 2006. [32] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145–175, May 2001. [33] Point Grey Research Inc. Ladybug 2 – http://www.ptgrey.com/products/ ladybug2/index.asp, 2005. [34] D. Scaramuzza, F. Fraundorfer, R. Siegwart, and M. Pollefeys. Closing the loop in appearance guided SfM for omnidirectional cameras. In OMNIVIS’08, pages 1–14, 2008. [35] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ’How Do I Organize My Holiday Snaps?’. In ECCV’02, pages I:414–431, 2002. [36] J. Šivic and A. Zisserman. Video Google: Efficient visual search of videos. In Toward Category-Level Object Recognition (CLOR), pages 127–144, 2006. [37] N. Snavely, S. Seitz, and R. Szeliski. Modeling the world from internet photo collections. IJCV, 80(2):189–210, 2008. [38] N. Snavely, S. Seitz, and R. Szeliski. Skeletal graphs for efficient structure from motion. In CVPR’08, pages 1–8, 2008. [39] J.P. Tardif, Y. Pavlidis, and K. Daniilidis. Monocular visual odometry in urban environments using an omdirectional camera. In IROS’08, pages 2531–2538, 2008. [40] A. Torii, M. Havlena, and T. Pajdla. Omnidirectional image stabilization by computing camera trajectory. In PSIVT’09, pages 71–82, 2009. [41] A. Torii and T. Pajdla. Omnidirectional camera motion estimation. In VISAPP’08, pages II:577–584, 2008. [42] M. Vergauwen and L. Van Gool. Web-based 3D reconstruction service. Machine Vision and Applications (MVA), 17(6):411–426, December 2006. 14 Resumé in Czech Disertace se zabývá výpočtem tvaru z pohybu (Structure from Motion) z rozsáhlých uspořádaných, tj. sekvenčnı́ch, a neuspořádaných množin obrazů. Navrhujeme nahradit nejvı́ce časově náročnou část výpočtu z neuspořádaných dat, hledánı́ vizuálnı́ch korespondencı́ mezi všemi páry obrazů, rychlým výpočtem odhadu vizuálnı́ho překryvu dvou obrazů na základě detekovaných výskytů vizuálnı́ch slov. Hledánı́ korespondencı́ pouze mezi páry obrazů s dostačným odhadnutým překryvem vede k významnému zrychlenı́ výpočtu. Efektivita rekonstrukce z redundantnı́ch množin obrazů, napřı́klad z obrazů městských památek stažených z Internetu, však může být dále zlepšena použitı́m navržené redukce velikosti vstupnı́ množiny obrazů rychlým grafovým algoritmem. Ke ztrátě efektivity výpočtu také docházı́, když použitá metoda trávı́ přı́liš mnoho času řešenı́m několika obtı́žných rekonstrukčnı́ch podúloh mı́sto hledánı́ jiných, často lehčı́ch, cest výpočtu. Navrhujeme využı́t prioritnı́ frontu k prokládánı́ jednotlivých podúloh, což umožňuje zı́skat rozumné 3D modely v omezeném čase. Priority jednotlivých úkolů jsou nastavovány s ohledem na odhadnuté vizuálnı́ překryvy, ale zároveň jsou ovlivněny i historiı́ výpočtu. Vizuálnı́ překryvy odhadnuté z opakujı́cı́ch se výskytů vizuálnı́ch slov se osvědčily také při zpracovánı́ uspořádaných množin obrazů. Geometricky ověřené návraty do dřı́ve navštı́vených částı́ scény, tzv. vizuálnı́ smyčky, přidáváme do konstruovaného 3D modelu jako nová omezenı́ pro nelineárnı́ metodu vyrovnánı́ svazků paprsků. Ta detekované smyčky uzavı́rá tı́m, že vynucuje globálnı́ konzistenci 3D modelu pro celou sekvenci. Byla navržena také řada technických vylepšenı́ výpočtu: (i) trojice obrazů jsou použity mı́sto párů jako zárodky rekonstrukcı́, protože 3D body ověřené ve třech pohledech jsou málokdy chybné. (ii) velikost posunu kamery vzhledem ke scéně, ukazatel spolehlivosti výpočtu relativnı́ho posunu kamery, je měřena dominantnı́m apikálnı́m úhlem. Nalezenı́m párů obrazů s dostatečně velkými apikálnı́mi úhly vybı́ráme klı́čové snı́mky ze vstupnı́ch sekvencı́ nebo kvalitnı́ zárodky rekonstrukcı́ z neuspořádaných množin obrazů. (iii) test průniku kuželů je použit mı́sto tradičně použı́vané reprojekčnı́ chyby k ověřovánı́ 2D3D korespondencı́ v RANSACu. Takto mohou být správné korespondence přijaty, i pokud jsou aktuálně odhadnuté polohy 3D bodů nesprávné. Funkčnost navržených metod je demonstrována množstvı́m experimentů z uspořádaných i neuspořádaných množin sestávajı́cı́ch se z tisı́ců obrazů. Modely částı́ měst jsou zkonstruovány jak z obrazů pořı́zených rybı́m okem, tak ze všesměrových panoramat. Obrazy vygenerované navrženou necentrálnı́ válcovou projekcı́ ze sekvencı́ stabilizovaných vzhledem k vodorovné rovině jsou úspěšně použity k detekci chodců. 15 Author’s Publications Publications related to the thesis Impacted journal articles [1] Akihiko Torii, Michal Havlena, and Tomáš Pajdla. Omnidirectional image stabilization for visual object recognition. International Journal of Computer Vision, 91(2):157–174, January 2011. Authorship: 50-25-25. Publications excerpted by WOS [2] Michal Havlena, Akihiko Torii, and Tomáš Pajdla. Efficient structure from motion by graph optimization. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Proceedings, Part II, volume 6312 of Lecture Notes in Computer Science, pages 100–113, Berlin, Germany, September 2010. Foundation for Research and Technology-Hellas (FORTH), Springer-Verlag. Authorship: 34-33-33. [3] Michal Havlena, Akihiko Torii, Jan Knopp, and Tomáš Pajdla. Randomized structure from motion based on atomic 3D models from camera triplets. In CVPR 2009: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2874–2881, Madison, USA, June 2009. IEEE Computer Society, Omnipress. Authorship: 30-30-10-30. [4] Akihiko Torii, Michal Havlena, and Tomáš Pajdla. Omnidirectional image stabilization by computing camera trajectory. In Toshikazu Wada, Fay Huang, and Stephen Y. Lin, editors, PSIVT ’09: Advances in Image and Video Technology: Third Pacific Rim Symposium, volume 5414 of Lecture Notes in Computer Science, pages 71–82, Berlin, Germany, January 2009. Springer Verlag. Authorship: 40-30-30. [5] Akihiko Torii, Michal Havlena, Tomáš Pajdla, and Bastian Leibe. Measuring camera translation by the dominant apical angle. In CVPR 2008: Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, page 7, Madison, USA, June 2008. IEEE Computer Society, Omnipress. Authorship: 30-30-30-10. [6] Michal Havlena, Tomáš Pajdla, and Kurt Cornelis. Structure from omnidirectional stereo rig motion for city modeling. In AlpeshKumar N. Ranchordas and Helder J. Araújo, editors, VISAPP 2008: Proceedings of the Third International Conference on Computer Vision Theory and Applications, volume 2, pages 407–414, Setúbal, Portugal, January 2008. INSTICC Press. Authorship: 34-33-33. 16 Other conference/congress publications [7] Michal Havlena, Michal Jančošek, Ben Huber, Frédéric Labrosse, Laurence Tyler, Tomáš Pajdla, Gerhard Paar, and Dave Barnes. Digital elevation modeling from Aerobot camera images. In EPSC-DPS 2011: European Planetary Science Congress Abstracts, volume 6, page 2, Goettingen, Germany, October 2011. Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship: 13-1313-13-12-12-12-12. [8] Michal Havlena, Michal Jančošek, Jan Heller, and Tomáš Pajdla. 3D surface models from Opportunity MER NavCam. In EPSC 2010: European Planetary Science Congress Abstracts, volume 5, page 2, Goettingen, Germany, September 2010. Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship: 25-25-25-25. [9] Akihiko Torii, Michal Havlena, and Tomas Pajdla. From Google street view to 3D city models. In OMNIVIS ’09: 9th IEEE Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras, page 8, Los Alamitos, USA, October 2009. IEEE Computer Society Press. Authorship: 40-30-30. [10] Michal Havlena, Akihiko Torii, Michal Jančošek, and Tomáš Pajdla. Automatic reconstruction of Mars artifacts. In EPSC 2009: European Planetary Science Congress Abstracts, volume 4, page 2, Goettingen, Germany, September 2009. Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship: 25-25-25-25. [11] Michal Havlena, Akihiko Torii, and Tomáš Pajdla. Camera trajectory from wide baseline images. In EPSC 2008: European Planetary Science Congress Abstracts, volume 3, page 2, Goettingen, Germany, September 2008. Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship: 34-33-33. [12] Michal Havlena. City modeling from omnidirectional video. In Bohuslav Řı́ha, editor, Proceedings of Workshop 2008, pages 114–115, Prague, Czech Republic, February 2008. Czech Technical University in Prague. Authorship: 100. [13] Michal Havlena, Kurt Cornelis, and Tomáš Pajdla. Towards city modeling from omnidirectional video. In Michael Grabner and Helmut Grabner, editors, CVWW 2007: Proceedings of the 12th Computer Vision Winter Workshop, pages 123–130, Graz, Austria, February 2007. Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria, Verlag der Technischen Universität Graz. Authorship: 34-33-33. 17 Additional publications Impacted journal articles [14] Daphna Weinshall, Alon Zweig, Hynek Hermansky, Stefan Kombrink, Frank W. Ohl, Jörn Anemüller, Jörg-Hendrik Bach, Luc Van Gool, Fabian Nater, Tomas Pajdla, Michal Havlena, and Misha Pavel. Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), 2012, to appear. Authorship: 9-9-9-9-8-8-8-8-8-8-8-8. Publications excerpted by WOS [15] Jan Heller, Michal Havlena, and Tomáš Pajdla. A branch-and-bound algorithm for globally optimal hand-eye calibration. In CVPR 2012: Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, June 2012, to appear. Authorship: 34-33-33. [16] Michal Havlena, Jan Heller, Hendrik Kayser, Jörg-Hendrik Bach, Jörn Anemüller, and Tomáš Pajdla. Incongruence detection in audio-visual processing. In Daphna Weinshall, Jörn Anemüller, and Luc Van Gool, editors, Detection and Identification of Rare Audiovisual Cues, volume 384 of Studies in Computational Intelligence, pages 67–75, Berlin, Germany, September 2012. Universitat Politècnica de Catalunya – BarcelonaTech (UPC), Springer-Verlag. Authorship: 17-17-17-1716-16. [17] Tomáš Pajdla, Michal Havlena, and Jan Heller. Learning from incongruence. In Daphna Weinshall, Jörn Anemüller, and Luc Van Gool, editors, Detection and Identification of Rare Audiovisual Cues, volume 384 of Studies in Computational Intelligence, pages 119–127, Berlin, Germany, September 2012. Universitat Politècnica de Catalunya – BarcelonaTech (UPC), Springer-Verlag. Authorship: 34-33-33. [18] Jan Heller, Michal Havlena, Akihiro Sugimoto, and Tomáš Pajdla. Structurefrom-motion based hand-eye calibration using L∞ minimization. In Pedro Felzenszwalb, David Forsyth, and Pascal Fua, editors, CVPR 2011: Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3497–3503, Los Alamitos, USA, June 2011. IEEE Computer Society, IEEE Computer Society. Authorship: 25-25-25-25. [19] Michal Havlena, Andreas Ess, Wim Moreau, Akihiko Torii, Michal Jančošek, Tomáš Pajdla, and Luc Van Gool. AWEAR 2.0 system: Omni-directional audiovisual data acquisition and processing. In EGOVIS 2009: Proceedings of the First Workshop on Egocentric Vision, pages 49–56, Madison, USA, June 2009. Omnipress. Authorship: 15-15-14-14-14-14-14. 18 Other conference/congress publications [20] Jörn Anemüller, Jörg-Hendrik Bach, Barbara Caputo, Michal Havlena, Luo Jie, Hendrik Kayser, Bastian Leibe, Petr Motlicek, Tomas Pajdla, Misha Pavel, Akihiko Torii, Luc Van Gool, Alon Zweig, and Hynek Hermansky. The DIRAC AWEAR audio-visual platform for detection of unexpected and incongruent events. In Vassilios Digalakis, Alexandros Potamianos, Matthew Turk, Roberto Pieraccini, and Yuri Ivanov, editors, ICMI 2008: Proceedings of the 10th International Conference on Multimodal Interfaces, pages 289–292, New York, USA, October 2008. Association for Computing Machinery, ACM. Authorship: 8-8-77-7-7-7-7-7-7-7-7-7-7. Citations of Author’s Work [1] Thomas Albrecht, Tele Tan, Geoff A. W. West, and Thanh Ly. Omnidirectional video stabilisation on a virtual camera using sensor fusion. In ICARCV 2010: Proceedings of the 11th International Conference on Control, Automation, Robotics and Vision, pages 2067–2072, 2010. [2] Jean-Charles Bazin, Cedric Demonceaux, Pascal Vasseur, and Inso Kweon. Rotation estimation and vanishing point extraction by omnidirectional vision in urban environment. International Journal of Robotics Research, 31(1):63–81, January 2012. [3] Jing Chen and Baozong Yuan. Metric 3D reconstruction from uncalibrated unordered images with hierarchical merging. In B.Z. Yuan, Q.Q. Ruan, and X.F. Tang, editors, ICSP 2010: Proceedings of the 2010 IEEE 10th International Conference on Signal Processing, Vols I-III, pages 1169–1172, 2010. [4] Shengyong Chen, Yuehui Wang, and Carlo Cattani. Key issues in modeling of complex 3D structures from video sequences. Mathematical Problems in Engineering, 2012. [5] Jerome Courchay, Arnak Dalalyan, Renaud Keriven, and Peter Sturm. Exploiting loops in the graph of trifocal tensors for calibrating a network of cameras. In K. Daniilidis, P. Maragos, and N. Paragios, editors, Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Proceedings, Part II, volume 6312 of Lecture Notes in Computer Science, pages 85–99, 2010. [6] Jerome Courchay, Arnak S. Dalalyan, Renaud Keriven, and Peter Sturm. On camera calibration with linear programming and loop constraint linearization. International Journal of Computer Vision, 97(1):71–90, March 2012. 19 [7] Jean-Lou De Carufel and Robert Laganiere. Matching cylindrical panorama sequences using planar reprojections. In OMNIVIS ’11: 11th IEEE Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras, 2011. [8] Weijia Feng, Juha Roning, Juho Kannala, Xiaoning Zong, and Baofeng Zhang. A general model and calibration method for spherical stereoscopic vision. In J. Roning and D.P. Casasent, editors, Intelligent Robots and Computer Vision XXIX: Algorithms and Techniques, volume 8301 of Proceedings of SPIE, 2012. [9] Danilo Hollosi, Stefan Wabnik, Stephan Gerlach, and Steffen Kortlang. Catalog of basic scenes for rare/incongruent event detection. In D. Weinshall, J. Anemuller, and L. Van Gool, editors, Detection and Identification of Rare Audiovisual Cues, volume 384 of Studies in Computational Intelligence, pages 77–84. SpringerVerlag, 2012. [10] Maxime Lhuillier. A generic error model and its application to automatic 3D modeling of scenes using a catadioptric camera. International Journal of Computer Vision, 91(2):175–199, January 2011. [11] Christopher Rasmussen, Yan Lu, and Mehmet Kocamaz. Trail following with omnidirectional vision. In IROS 2010: Proceedings of the IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, 2010. [12] Richard Roberts, Sudipta N. Sinha, Richard Szeliski, and Drew Steedly. Structure from motion for scenes with large duplicate structures. In CVPR 2011: Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011. [13] Wolfgang Stuerzl, Darius Burschka, and Michael Suppa. Monocular ego-motion estimation with a compact omnidirectional camera. In IROS 2010: Proceedings of the IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, pages 822–828, 2010. [14] Yuan-Kai Wang, Ching-Tang Fan, Shao-Ang Chen, and Hou-Ye Chen. X-Eye: A novel wearable vision system. In N. Kehtarnavaz and M.F. Carlsohn, editors, Real-time Image and Video Processing 2011, volume 7871 of Proceedings of SPIE, 2011. [15] Marc Wieland, Massimiliano Pittore, Stefano Parolai, Jochen Zschau, Bolot Moldobekov, and Ulugbek T. Begaliev. Estimating building inventory for rapid seismic vulnerability assessment: Towards an integrated approach based on multisource imaging. Soil Dynamics and Earthquake Engineering, 36:70–83, May 2012. [16] Christopher Zach, Manfred Klopschitz, and Marc Pollefeys. Disambiguating visual relations using loop constraints. In CVPR 2010: Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1426–1433, 2010. 20