DOCTORAL THESIS STATEMENT - Center for Machine Perception

advertisement
CZECH TECHNICAL UNIVERSITY IN PRAGUE
DOCTORAL THESIS STATEMENT
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Cybernetics
Michal Havlena
Incremental Structure from Motion
for Large Ordered and Unordered
Sets of Images
Ph.D. Programme: Electrical Engineering and Information Technology, P2612
Branch of study: Artificial Intelligence and Biocybernetics, 3902V035
Doctoral thesis statement for obtaining the academic title of “Doctor”,
abbreviated to “Ph.D.”
Prague, June 2012
The doctoral thesis was produced in full-time manner during Ph.D. study at the
Center for Machine Perception of the Department of Cybernetics of the Faculty
of Electrical Engineering of the CTU in Prague.
Candidate:
Mgr. Michal Havlena
Department of Cybernetics
Faculty of Electrical Engineering of the CTU in Prague
Supervisor:
Ing. Tomáš Pajdla, Ph.D.
Department of Cybernetics
Faculty of Electrical Engineering of the CTU in Prague
Opponents:
............................................................................................
............................................................................................
............................................................................................
The doctoral thesis statement was distributed on ................... .
The defence of the doctoral thesis will be held on .................... at ....... a.m./p.m.
before the Board for the Defence of the Doctoral Thesis in the branch of study
Artificial Intelligence and Biocybernetics in the meeting room No. ......... of the
Faculty of Electrical Engineering of the CTU in Prague.
Those interested may get acquainted with the doctoral thesis concerned at the
Dean Office of the Faculty of Electrical Engineering of the CTU in Prague, at
the Department for Science and Research, Technická 2, 166 27 Prague 6.
Prof. Ing. Vladimı́r Mařı́k, DrSc.
Chairman of the Board for the Defence of the Doctoral Thesis
in the branch of study Artificial Intelligence and Biocybernetics
Department of Cybernetics, Karlovo náměstı́ 13, 121 35 Prague 2
Incremental Structure from Motion for Large Ordered and Unordered Sets of Images
Contents
1 Problem Formulation
2
2 Contributions
3
3 State of the Art
5
4 City Modeling from Google Street View
6
5 Omnidirectional Sequence Stabilization
7
6 Randomized Structure from Motion
9
7 Image Set Reduction and Prioritized SfM
10
8 Conclusions
12
Keywords:
Structure from Motion, Omnidirectional Vision, City Modeling
Note: Full text of the thesis is available at ftp://cmp.felk.cvut.cz/
pub/cmp/articles/havlena/Havlena-TR-2012-13.pdf
1 Problem Formulation
In this thesis, we deal with the problem of modeling large predominantly static
scenes from ordered, i.e. sequential, and unordered image sets comprising thousands of images when there is no further a priori information about the scene
captured available. To solve such a task, we build upon the foundations of
multiple view geometry [14], namely Structure from Motion.
The goal of Structure from Motion (SfM) is to recover both the positions of
the 3D points of the scene (structure) and the unknown poses of the cameras
capturing the scene (motion, external camera calibration). Knowing camera
poses, one can use multi-view stereo 3D surface reconstruction methods [9, 17]
to construct compelling 3D surface models of the scene, see Figure 1.
There is no closed form solution for solving the SfM problem [30]. The
problem is highly non-linear and the search for the optimal model parameters
by minimizing the reprojection error is likely to get stuck in a local minimum.
The methods described in this thesis fall into the group of the “incremental”
methods which use simple components such as epipolar geometry computation
for a pair of images, 3D point triangulation, and camera resectioning to build the
resulting model incrementally, starting from one or more seed reconstructions
and connecting additional cameras until the whole scene is reconstructed [37],
all interleaved with necessary non-linear optimization [22]. The art of such
methods resides in the design of a computational “pipeline” which combines
the aforementioned components in order to create a reconstruction procedure
which is both robust and efficient for the input image data. When the size of
the image set grows, exhaustive pairwise image matching commonly used for
revealing the structure of the data becomes infeasible.
(a)
(b)
(c)
Figure 1: (a) Sample images from 316 images of Marseille Centre de la Vieille Charité.
(b) External camera calibration and a sparse 3D point cloud obtained using method [38].
(c) 3D surface reconstruction computed by method [17].
2
2 Contributions
The contribution of the thesis is related to large scale Structure from Motion
(SfM) from both ordered and unordered image sets. The research on sequential
SfM was conducted in collaboration with my colleague, Akihiko Torii, and
the research on SfM from unordered data was done by me. Specifically, my
contributions are the following:
• Use of visual indexing for image pair/triplet selection. We avoid the
most time consuming step of large scale SfM from unordered image sets,
the computation of all pairwise matches and geometries, by sampling
pairs of images and estimating visual overlap using the detected occurrences of visual words. The evaluation of the similarity scores by computing scalar products of so-called tf-idf vectors [36] is also quadratic
in the number of images in the set but scalar product is a much simpler
operation than full feature matching which leads to a significant speedup
of SfM. Furthermore, we proposed to sample triplets of images instead
of pairs for the seeds of the reconstruction because 3D points verified
in three views are more likely to be correct. The constructed atomic
3D models are merged together to give the final large scale 3D model at
later stages of the computation.
• Image set reduction by applying a graph algorithm (CDS). The idea
of using visual indexing for SfM was further extended in order to be able
to reconstruct image sets with uneven image coverage, i.e. community
image sets of cities with landmarks, efficiently. A small subset from the
set of input images is selected by computing the approximate minimum
connected dominating set of a graph with vertices being the images and
edges connecting the visually similar images by a fast polynomial algorithm [13]. This kind of reduction guarantees, to some extent, that the
removed images have visual overlap with at least some images left in the
set and therefore can be connected to the resulting 3D model later.
• Task ordering using a priority queue. We use task prioritization to
avoid spending too much time in a few difficult matching problems instead of exploring other easier options. Compared to our previous work
having the computation spit in several stages [15], the usage of a priority queue for interleaving different “atomic 3D model construction” and
“image connection” tasks facilitates obtaining reasonable reconstructions
in limited time. The priorities of the individual tasks are set according to
image similarity and the history of the computation.
3
Joint contributions follow:
• Computation of dominant apical angle (DAA). When performing sequential SfM by chaining pairwise epipolar geometries [14], the reconstruction usually fails when the amount of translation between consecutive cameras is not sufficient. We demonstrate that the amount of translation can be reliably measured for general as well as planar scenes by
the most frequent apical angle, the angle under which the camera centers
are seen from the perspective of the reconstructed scene points. By selecting only image pairs which have sufficient DAA, one is able to easily
reconstruct even sequences with variable camera motion speed.
• Sequence bridging by visual indexing. We extend the known concept
of loop closing, e.g. [16], which tries to correct the trajectory of the camera once the same place is re-visited, by searching for all the trajectory
loops at once based on co-occurring visual words. Geometrically verified
loop candidates are added to the model as new constraints for bundle adjustment which closes the detected loops as it enforces global consistency
of camera poses and 3D structure in the sequence.
• Image stabilization using non-central cylindrical image generation.
A new technique for omnidirectional image rectification based on stereographic image projection was introduced as an alternative to central cylindrical image generation. We show that non-central cylindrical images
are suitable for people recognition with classifiers trained on perspective
data, e.g. [6], once the images are stabilized w.r.t. the ground plane.
• Using cone test instead of reprojection error. When verifying 2D3D matches in a RANSAC loop [7], we do not rely on the widely used
reprojection error but make use of the fact that the projections of the
3D point to all the related images are known and use a “cone test” instead.
Two pixels wide pyramids are cast through the corresponding pixel locations in the related cameras and an LP feasibility task is solved to decide
whether the intersection of the “cones” is non-empty or not. This allows
for accepting a correct match even if the currently estimated 3D point
location is incorrect without modeling the probability distribution of the
location of the 3D point explicitly.
It is worth mentioning that the methods were implemented to work with the
general central camera model, covering the most common special cases including (i) perspective cameras, (ii) fish-eye lenses, (iii) equirectangular panoramas,
and (iv) cameras calibrated by the polynomial omnidirectional model.
4
3 State of the Art
The complexity of large scale SfM computation is quite different for ordered
and unordered image sets as image order gives a clue which pairs of images
should have an overlapping field of view and are therefore suitable for processing. On the other hand, the methods for fast selection of promising image
pairs can be used also for ordered image sets to improve the consistency of the
resulting models via loop closing.
Ordered Image Set Processing
Short baseline SfM using simple image features [5], which performs real-time
detection and matching, recovers camera poses and trajectory sufficiently well
when all camera motions between consecutive frames in the sequence are small.
On the other hand, wide baseline SfM based methods, which use richer features such as MSER [25], Laplacian-Affine, Hessian-Affine [27], SIFT [23],
and SURF [2], are capable of producing feasible tentative matches under large
changes of visual appearance between images induced by rapid changes of camera pose and illumination. Work [10] presented SfM based on wide baseline
matching of SIFT features using a single omnidirectional camera and demonstrated the performance on indoor environments.
The state of the art technique for finding relative camera poses from image
matches first establishes tentative matches by pairing image points with mutually similar features and then uses RANSAC [7, 14, 4] to look for a large
subset of the set of tentative matches which is, within a predefined threshold
ε, consistent with an epipolar geometry (EG) [14]. Unfortunately, this strategy
does not always recover the epipolar geometry generated by the actual camera
motion, which has been observed in [20, 29, 41]. Often, there are more models
which are supported by a large number of matches. Then, the chance that the
correct model, even if it has the largest support, will be found by running a single RANSAC is small. Work [20] suggested to generate models by randomized
sampling as in RANSAC but to use soft (kernel) voting for a physical parameter
instead of looking for the maximal support.
Work [26] demonstrated 3D modeling from perspective images exported
from Google Street View images using piecewise planar structure constraints.
Another recent related work [39] demonstrated the performance of SfM which
employs guided matching by using epipolar geometries computed in previous
frames, and robust camera trajectory estimation by computing camera orientations and positions individually for the calibrated perspective images acquired
by Point Grey Ladybug Spherical Digital Video Camera System [33].
5
In [18], loop closing capable of removing the drift error of sequential SfM
is achieved by merging partial reconstructions of overlapping sequences which
are extracted using an image similarity matrix [36, 19]. Work [34] finds loop
endpoints by using the image similarity matrix and verifies the loops by computing the rotation transform between the pairs of origins and endpoints under
the assumption that the positions of the origin and the endpoint of each loop
coincide. Furthermore, they constraint the camera motions on a plane to reduce
the number of parameters in bundle adjustment.
Unordered Image Set Processing
Most of the state of the art techniques for 3D reconstruction from unordered image sets [35, 3, 42, 24] start the computation by performing exhaustive pairwise
image matching in order to reveal the structure and connectivity of the data.
Bundler [37] uses exhaustive pairwise image feature matching and epipolar geometry computation to create an image graph which is later used to lead
the reconstruction. By finding the skeletal set [38] of this graph, the reconstruction time improves significantly but the time spent on image matching
remains the same. Recent advancement of the aforementioned technique [1]
abandons exhaustive pairwise image matching by using shared occurrences of
visual words [31, 36] to match only the ten most promising images per each
input image. A similar approach was used in Google Maps to construct the
models of several popular landmarks from all around the world using usercontributed Picasa and Panoramio photos [12].
Another possible approach to reducing the number of necessary pairwise image feature matchings lies in reducing the number of images to be processed
because the input image set may be highly redundant. The approach presented
in [21] clusters the input images using the GIST [32] descriptor giving raise
to “iconic images”. These images and the pairwise geometric relationships
between them define an “iconic scene graph” that captures all the important aspects of the original image set. In [8], the method has been re-implemented in
order to be highly parallel and therefore suitable for GPU computing.
4 City Modeling from Google Street View
When constructing 3D models of large city areas, it is beneficial to use 360◦
field of view images, see Figure 2, as it increases robustness against occlusions.
The main contribution of the presented method lies in demonstrating that one
can achieve SfM from a single sparse omnidirectional sequence with only an
6
(a)
(b)
Figure 2: Camera trajectory computed by SfM. (a) Camera positions (red circles) exported into Google Earth [11]. (b) The 3D model representing 4,799 camera positions
(red circles) and 123,035 3D points (color dots).
approximate knowledge of calibration as opposed to [5, 39] where the models
are computed from dense sequences and with precisely calibrated cameras.
The proposed SfM pipeline is an extension of work [40] which demonstrated
the performance of the recovery of camera poses and trajectory on the image
sequence acquired by a single fish-eye lens camera. Loop closing is facilitated by visual indexing. SURF [2] descriptors of each image are quantized
into visual words and term frequency–inverse document frequency (tf-idf) vectors [36, 19] are computed. Image similarity matrix M is constructed by computing the cosines of angles between normalized tf-idf vectors, i.e. their scalar
products, for all pairs of images.
The 1st to 50th diagonals of M are zeroed in order to exclude very small loops
and for each image Ii in the sequence, a candidate Ij of the endpoint of the loop
which starts from Ii is selected as the one having the highest similarity score in
the i-th row of M. Next, the candidate image Ij is verified by solving the camera
resectioning [28]. If the inlier ratio is higher than 70%, camera resectioning is
considered successful and the candidate image Ij is accepted as the endpoint of
the loop. The validated candidates are used to give additional constraints on the
final bundle adjustment [22].
5 Omnidirectional Sequence Stabilization
In order to make the wide baseline SfM pipeline capable to recover camera
poses and trajectories even from sequences that have large differences in the
7
X
τ
α
α
τ
Rx
C
x
t
C
Figure 3: The apical angle τ at the point X reconstructed from the correspondence
(x, x ) relatively depends on the length of the camera translation t and on the distances
of X from the camera centers C, C .
amount of camera translation between consecutive frames, a keyframe selection
method based on DAA computation is proposed. Secondly, stabilized noncentral cylindrical images are generated to facilitate pedestrian detection using
detectors trained on perspective images using the known ground plane position
assumption.
Measuring the Amount of Camera Translation by DAA. Having m
matches (xi , xi ) and the essential matrix E computed from them, we can reconstruct 3D points Xi . Figure 3 shows a point X reconstructed from an image
match (x, x ). For each point X, the apical angle τ , which measures the length
of the camera translation from the perspective of the point X, is computed. If
the cameras are related by pure rotation, all angles τ are equal to zero. The
larger is the camera translation, the larger are the angles τ .
For a given E and m matches (xi , xi ), one can select the decomposition of E
to R and t, which reconstructs the largest number of 3D points in front of the
cameras. The apical angle τi , corresponding to the match (xi , xi ), is computed
by solving a set of linear equations for the relative distances αi , αi
αi xi = αi R xi − t
(1)
in the least square sense and by using the law of cosines
2
2 αi αi cos(τi ) = αi 2 + αi − t2 .
(2)
For a small translation w.r.t. the distance to the scene points, approximation
αi = αi can be used and the apical angle τi becomes a linear function of t.
8
(a)
(b)
(c)
(d)
Figure 4: Image stabilization and transformation. (a) The camera poses and the world
3D points reconstructed by our SfM visualized from a bird’s eye view. (b) Original
images. (c) Non-stabilized images. (d) Stabilized images.
Omnidirectional Image Stabilization. The recovered camera poses and
trajectory can be used to rectify the original images to the stabilized ones. Image stabilization is beneficial e.g. for facilitating visual object recognition where
(i) objects can be detected in canonical orientations and (ii) ground plane position can further restrict feasible object locations. When the sequence is captured by walking or driving on the roads, the images can be stabilized w.r.t.
the ground plane with a natural assumption that the motion direction is parallel
to the ground plane. If there exists no constraint on camera motion in the sequence, the simplest way of stabilization is to rectify images w.r.t. the up vector
in the coordinate system of the first camera, see Figure 4.
6 Randomized Structure from Motion
The computation of the randomized SfM pipeline for unordered image sets consists of four consecutive steps which are executed one after another: (i) computing image similarity matrix, (ii) constructing atomic 3D models from camera
9
Figure 5: Overview of the pipeline. Input images are described by SURF and an image
similarity matrix is computed. Atomic 3D models are constructed from camera triplets,
merged together into partial reconstructions, and finally single cameras are glued to the
largest partial reconstruction.
triplets, (iii) merging partial reconstructions, and (iv) gluing single cameras to
the best partial reconstruction, see Figure 5.
Image similarity matrix M is used to select triplets of cameras suitable for
constructing atomic 3D models. The maximum score in the matrix gives a pair
of cameras Ci and Cj . Then, three “third camera” candidates are found and
atomic 3D models are constructed for each of the candidates. The resulting
models are ranked by the quality score, which checks (i) whether there is a sufficient number of 3D points with large apical angles and (ii) the uniformity of
image coverage by the projections of reconstructed 3D points, and the model
with the highest quality score is selected. Denoting the third camera corresponding to the selected atomic 3D model as Ck , cameras Ci , Cj , and Ck are
removed from future selections by zeroing rows and columns i, j, and k of M.
When the atomic 3D model construction step finishes, each accepted atomic
3D model gives raise to a partial reconstruction and merging guided by a similarity matrix containing scores between selected atomic 3D models is performed. During the merging step, the partial reconstructions are connected
together forming larger partial reconstructions containing the union of cameras
and 3D points of the connected reconstructions. Finally, in the gluing step, the
best partial reconstruction is selected as the one containing the highest number
of cameras and the poses of the cameras which are not contained in it are estimated [28]. Cone test is used to evaluate the support of individual RANSAC
samples during merging and gluing.
7 Image Set Reduction and Prioritized SfM
Unstructured web collections often contain a large number of very similar images of landmarks while, on the other hand, image sequences often have a very
limited overlap between images. To speed up the reconstruction, it is desirable
10
Figure 6: Schematic visualization of the computation. The task retrieved from the head
of the priority queue can be either an atomic 3D model construction task (dark gray) or
an image connection task (light gray). Unsuccessful atomic 3D model construction (–)
inserts another atomic 3D model construction task with the priority key doubled into
the queue, a successful one (+) inserts five image connection tasks. Unsuccessful image
connection (–) inserts the same task again with the priority key doubled, a successful one
(+) inserts a new image connection task. Merging of overlapping 3D models is called
implicitly after every successful image connection if the overlap is sufficient.
to select a subset of the input images in such a way that all the remaining images have a significant visual overlap with at least one image from the selected
ones, so the connectivity of the resulting model should not be damaged. For
selecting such a subset of input images, the approximate minimum connected
dominating set can be computed by a fast polynomial algorithm [13] on the
graph constructed according to the estimated visual overlap. This is closely
related to the maximum leaf spanning tree algorithm employed in [38] but the
composition of the graph is less computationally demanding in our case. Compared to [21] using GIST, our method is more robust to viewpoint changes.
The proposed SfM pipeline, reconstructing only the selected subset of images, still uses the atomic 3D models constructed from camera triplets as the
basic elements of the reconstruction but the strict division of the computation
into steps is relaxed by introducing a priority queue which interleaves different
reconstruction tasks, see Figure 6, in order to get a good scene covering recon-
11
(a)
(b)
Figure 7: The largest partial 3D models reconstructed from the reduced image sets
after 6 hours of computation. (a) Omnidirectional image set CASTLE (1,063 images).
(b) Perspective image set VIENNA (1,008 images).
struction in limited time. Our aim here is to avoid spending too much time in
a few difficult matching problems by exploring other easier options which lead
to a comparable resulting 3D model in shorter computational time compared to
the previous method, see Figure 7 for the obtained 3D models.
8 Conclusions
The thesis contributes to improving scalability and efficiency of Structure from
Motion computation from both ordered and unordered image sets. In particular,
the usefulness of visual indexing for estimating visual overlap between images
was shown. When visual indexing is used, exhaustive pairwise image matching
can be avoided, the sizes of redundant image sets can be significantly reduced,
and the SfM computation can be properly prioritized.
The reconstruction pipelines are accessible to the registered users through
our web-based interface, http://ptak.felk.cvut.cz/sfmservice, and
were successfully used to reconstruct many challenging image sets.
References
[1] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski. Building Rome in a day. In
ICCV’09, pages 72–79, 2009.
12
[2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). CVIU,
110(3):346–359, June 2008.
[3] M. Brown and D. Lowe. Unsupervised 3D object recognition and reconstruction in unordered
datasets. In 3-D Digital Imaging and Modeling (3DIM), pages 56–63, 2005.
[4] O. Chum and J. Matas. Matching with PROSAC: Progressive sample consensus. In CVPR’05,
pages I:220–226, 2005.
[5] N. Cornelis, K. Cornelis, and L. Van Gool. Fast compact city modeling for navigation previsualization. In CVPR’06, pages II:1339–1344, 2006.
[6] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mobile vision system for robust multiperson tracking. In CVPR’08, pages 1–8, 2008.
[7] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model fitting with
applications to image analysis and automated cartography. Comm. ACM, 24(6):381–395, June
1981.
[8] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, Ch. Wu, Y.-H. Jen,
E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys. Building rome on a cloudless day. In
ECCV’10, pages IV:368–381, 2010.
[9] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. In CVPR’07,
2007.
[10] T. Goedemé, M. Nuttin, T. Tuytelaars, and L. Van Gool. Omnidirectional vision based topological navigation. IJCV, 74(3):219–236, September 2007.
[11] Google. Google Earth – http://earth.google.com, 2004.
[12] Google. Photo tours in Google Maps – http://maps.google.com/phototours,
2012.
[13] S. Guha and S. Khuller. Approximation algorithms for connected dominating sets. Algorithmica, 20(4):374–387, 1998.
[14] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, second edition, 2003.
[15] M. Havlena, A. Torii, J. Knopp, and T. Pajdla. Randomized structure from motion based on
atomic 3D models from camera triplets. In CVPR’09, pages 2874–2881, 2009.
[16] K.L. Ho and P. Newman. Detecting loop closure with scene sequences. IJCV, 74(3):261–286,
September 2007.
[17] M. Jancosek and T. Pajdla. Multi-view reconstruction preserving weakly-supported surfaces.
In CVPR’11, pages 3121–3128, 2011.
[18] M. Klopschitz, C. Zach, A. Irschara, and D. Schmalstieg. Generalized detection and merging
of loop closures for video sequences. In 3DPVT’08, pages 137–144, 2008.
[19] J. Knopp, J. Šivic, and T. Pajdla. Location recognition using large vocabularies and fast spatial
matching. Research Report CTU–CMP–2009–01, CMP Prague, January 2009.
[20] H. Li and R. Hartley. A non-iterative method for correcting lens distortion from nine point
correspondences. In OMNIVIS’05, pages 1–8, 2005.
[21] X.W. Li, C.C. Wu, C. Zach, S. Lazebnik, and J.-M. Frahm. Modeling and recognition of
landmark image collections using iconic scene graphs. In ECCV’08, pages I:427–440, 2008.
13
[22] M.I.A. Lourakis and A.A. Argyros. The design and implementation of a generic sparse bundle
adjustment software package based on the Levenberg-Marquardt algorithm. Tech. Report 340,
Institute of Computer Science – FORTH, August 2004.
[23] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110,
November 2004.
[24] D. Martinec and T. Pajdla. Robust rotation and translation estimation in multiview reconstruction. In CVPR’07, pages 1–8, 2007.
[25] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally
stable extremal regions. IVC, 22(10):761–767, September 2004.
[26] B. Mičušı́k and J. Košecká. Piecewise planar city 3D modeling from street view panoramic
sequences. In CVPR’09, pages 2906–2912, 2009.
[27] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir,
and L. Van Gool. A comparison of affine region detectors. IJCV, 65(1-2):43–72, November
2005.
[28] D. Nistér. A minimal solution to the generalized 3-point pose problem. In CVPR’04, pages
I:560–567, 2004.
[29] D. Nistér and C. Engels. Estimating global uncertainty in epipoloar geometry for vehiclemounted cameras. In SPIE – Unmanned Systems Technology VIII, pages 62301L:1–12, 2006.
[30] D. Nistér, F. Kahl, and H. Stewénius. Structure from motion with missing data is np-hard. In
ICCV’07, pages 1–7, 2007.
[31] D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In CVPR’06, pages
II: 2161–2168, 2006.
[32] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the
spatial envelope. IJCV, 42(3):145–175, May 2001.
[33] Point Grey Research Inc. Ladybug 2 – http://www.ptgrey.com/products/
ladybug2/index.asp, 2005.
[34] D. Scaramuzza, F. Fraundorfer, R. Siegwart, and M. Pollefeys. Closing the loop in appearance
guided SfM for omnidirectional cameras. In OMNIVIS’08, pages 1–14, 2008.
[35] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ’How
Do I Organize My Holiday Snaps?’. In ECCV’02, pages I:414–431, 2002.
[36] J. Šivic and A. Zisserman. Video Google: Efficient visual search of videos. In Toward
Category-Level Object Recognition (CLOR), pages 127–144, 2006.
[37] N. Snavely, S. Seitz, and R. Szeliski. Modeling the world from internet photo collections.
IJCV, 80(2):189–210, 2008.
[38] N. Snavely, S. Seitz, and R. Szeliski. Skeletal graphs for efficient structure from motion. In
CVPR’08, pages 1–8, 2008.
[39] J.P. Tardif, Y. Pavlidis, and K. Daniilidis. Monocular visual odometry in urban environments
using an omdirectional camera. In IROS’08, pages 2531–2538, 2008.
[40] A. Torii, M. Havlena, and T. Pajdla. Omnidirectional image stabilization by computing camera trajectory. In PSIVT’09, pages 71–82, 2009.
[41] A. Torii and T. Pajdla. Omnidirectional camera motion estimation. In VISAPP’08, pages
II:577–584, 2008.
[42] M. Vergauwen and L. Van Gool. Web-based 3D reconstruction service. Machine Vision and
Applications (MVA), 17(6):411–426, December 2006.
14
Resumé in Czech
Disertace se zabývá výpočtem tvaru z pohybu (Structure from Motion) z rozsáhlých uspořádaných, tj. sekvenčnı́ch, a neuspořádaných množin obrazů. Navrhujeme nahradit nejvı́ce časově náročnou část výpočtu z neuspořádaných dat,
hledánı́ vizuálnı́ch korespondencı́ mezi všemi páry obrazů, rychlým výpočtem
odhadu vizuálnı́ho překryvu dvou obrazů na základě detekovaných výskytů
vizuálnı́ch slov. Hledánı́ korespondencı́ pouze mezi páry obrazů s dostačným
odhadnutým překryvem vede k významnému zrychlenı́ výpočtu. Efektivita rekonstrukce z redundantnı́ch množin obrazů, napřı́klad z obrazů městských památek stažených z Internetu, však může být dále zlepšena použitı́m navržené
redukce velikosti vstupnı́ množiny obrazů rychlým grafovým algoritmem.
Ke ztrátě efektivity výpočtu také docházı́, když použitá metoda trávı́ přı́liš
mnoho času řešenı́m několika obtı́žných rekonstrukčnı́ch podúloh mı́sto hledánı́
jiných, často lehčı́ch, cest výpočtu. Navrhujeme využı́t prioritnı́ frontu k prokládánı́ jednotlivých podúloh, což umožňuje zı́skat rozumné 3D modely v omezeném čase. Priority jednotlivých úkolů jsou nastavovány s ohledem na odhadnuté vizuálnı́ překryvy, ale zároveň jsou ovlivněny i historiı́ výpočtu.
Vizuálnı́ překryvy odhadnuté z opakujı́cı́ch se výskytů vizuálnı́ch slov se
osvědčily také při zpracovánı́ uspořádaných množin obrazů. Geometricky ověřené návraty do dřı́ve navštı́vených částı́ scény, tzv. vizuálnı́ smyčky, přidáváme
do konstruovaného 3D modelu jako nová omezenı́ pro nelineárnı́ metodu vyrovnánı́ svazků paprsků. Ta detekované smyčky uzavı́rá tı́m, že vynucuje globálnı́ konzistenci 3D modelu pro celou sekvenci.
Byla navržena také řada technických vylepšenı́ výpočtu: (i) trojice obrazů
jsou použity mı́sto párů jako zárodky rekonstrukcı́, protože 3D body ověřené ve
třech pohledech jsou málokdy chybné. (ii) velikost posunu kamery vzhledem
ke scéně, ukazatel spolehlivosti výpočtu relativnı́ho posunu kamery, je měřena
dominantnı́m apikálnı́m úhlem. Nalezenı́m párů obrazů s dostatečně velkými
apikálnı́mi úhly vybı́ráme klı́čové snı́mky ze vstupnı́ch sekvencı́ nebo kvalitnı́ zárodky rekonstrukcı́ z neuspořádaných množin obrazů. (iii) test průniku
kuželů je použit mı́sto tradičně použı́vané reprojekčnı́ chyby k ověřovánı́ 2D3D korespondencı́ v RANSACu. Takto mohou být správné korespondence
přijaty, i pokud jsou aktuálně odhadnuté polohy 3D bodů nesprávné.
Funkčnost navržených metod je demonstrována množstvı́m experimentů z
uspořádaných i neuspořádaných množin sestávajı́cı́ch se z tisı́ců obrazů. Modely částı́ měst jsou zkonstruovány jak z obrazů pořı́zených rybı́m okem, tak ze
všesměrových panoramat. Obrazy vygenerované navrženou necentrálnı́ válcovou projekcı́ ze sekvencı́ stabilizovaných vzhledem k vodorovné rovině jsou
úspěšně použity k detekci chodců.
15
Author’s Publications
Publications related to the thesis
Impacted journal articles
[1] Akihiko Torii, Michal Havlena, and Tomáš Pajdla. Omnidirectional image stabilization for visual object recognition. International Journal of Computer Vision,
91(2):157–174, January 2011. Authorship: 50-25-25.
Publications excerpted by WOS
[2] Michal Havlena, Akihiko Torii, and Tomáš Pajdla. Efficient structure from motion
by graph optimization. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios,
editors, Computer Vision - ECCV 2010, 11th European Conference on Computer
Vision, Proceedings, Part II, volume 6312 of Lecture Notes in Computer Science,
pages 100–113, Berlin, Germany, September 2010. Foundation for Research and
Technology-Hellas (FORTH), Springer-Verlag. Authorship: 34-33-33.
[3] Michal Havlena, Akihiko Torii, Jan Knopp, and Tomáš Pajdla. Randomized structure from motion based on atomic 3D models from camera triplets. In CVPR 2009:
Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2874–2881, Madison, USA, June 2009. IEEE
Computer Society, Omnipress. Authorship: 30-30-10-30.
[4] Akihiko Torii, Michal Havlena, and Tomáš Pajdla. Omnidirectional image stabilization by computing camera trajectory. In Toshikazu Wada, Fay Huang, and
Stephen Y. Lin, editors, PSIVT ’09: Advances in Image and Video Technology:
Third Pacific Rim Symposium, volume 5414 of Lecture Notes in Computer Science, pages 71–82, Berlin, Germany, January 2009. Springer Verlag. Authorship:
40-30-30.
[5] Akihiko Torii, Michal Havlena, Tomáš Pajdla, and Bastian Leibe. Measuring camera translation by the dominant apical angle. In CVPR 2008: Proceedings of the
2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, page 7, Madison, USA, June 2008. IEEE Computer Society, Omnipress.
Authorship: 30-30-30-10.
[6] Michal Havlena, Tomáš Pajdla, and Kurt Cornelis. Structure from omnidirectional
stereo rig motion for city modeling. In AlpeshKumar N. Ranchordas and Helder J.
Araújo, editors, VISAPP 2008: Proceedings of the Third International Conference
on Computer Vision Theory and Applications, volume 2, pages 407–414, Setúbal,
Portugal, January 2008. INSTICC Press. Authorship: 34-33-33.
16
Other conference/congress publications
[7] Michal Havlena, Michal Jančošek, Ben Huber, Frédéric Labrosse, Laurence Tyler,
Tomáš Pajdla, Gerhard Paar, and Dave Barnes. Digital elevation modeling from
Aerobot camera images. In EPSC-DPS 2011: European Planetary Science
Congress Abstracts, volume 6, page 2, Goettingen, Germany, October 2011. Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship: 13-1313-13-12-12-12-12.
[8] Michal Havlena, Michal Jančošek, Jan Heller, and Tomáš Pajdla. 3D surface models from Opportunity MER NavCam. In EPSC 2010: European Planetary Science
Congress Abstracts, volume 5, page 2, Goettingen, Germany, September 2010.
Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship:
25-25-25-25.
[9] Akihiko Torii, Michal Havlena, and Tomas Pajdla. From Google street view to
3D city models. In OMNIVIS ’09: 9th IEEE Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras, page 8, Los Alamitos, USA,
October 2009. IEEE Computer Society Press. Authorship: 40-30-30.
[10] Michal Havlena, Akihiko Torii, Michal Jančošek, and Tomáš Pajdla. Automatic
reconstruction of Mars artifacts. In EPSC 2009: European Planetary Science
Congress Abstracts, volume 4, page 2, Goettingen, Germany, September 2009.
Europlanet Research Infrastructure, Copernicus Gesellschaft mbH. Authorship:
25-25-25-25.
[11] Michal Havlena, Akihiko Torii, and Tomáš Pajdla. Camera trajectory from wide
baseline images. In EPSC 2008: European Planetary Science Congress Abstracts,
volume 3, page 2, Goettingen, Germany, September 2008. Europlanet Research
Infrastructure, Copernicus Gesellschaft mbH. Authorship: 34-33-33.
[12] Michal Havlena. City modeling from omnidirectional video. In Bohuslav Řı́ha,
editor, Proceedings of Workshop 2008, pages 114–115, Prague, Czech Republic,
February 2008. Czech Technical University in Prague. Authorship: 100.
[13] Michal Havlena, Kurt Cornelis, and Tomáš Pajdla. Towards city modeling from
omnidirectional video. In Michael Grabner and Helmut Grabner, editors, CVWW
2007: Proceedings of the 12th Computer Vision Winter Workshop, pages 123–130,
Graz, Austria, February 2007. Institute for Computer Graphics and Vision, Graz
University of Technology, Graz, Austria, Verlag der Technischen Universität Graz.
Authorship: 34-33-33.
17
Additional publications
Impacted journal articles
[14] Daphna Weinshall, Alon Zweig, Hynek Hermansky, Stefan Kombrink, Frank W.
Ohl, Jörn Anemüller, Jörg-Hendrik Bach, Luc Van Gool, Fabian Nater, Tomas
Pajdla, Michal Havlena, and Misha Pavel. Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PP(99), 2012, to appear. Authorship:
9-9-9-9-8-8-8-8-8-8-8-8.
Publications excerpted by WOS
[15] Jan Heller, Michal Havlena, and Tomáš Pajdla. A branch-and-bound algorithm for
globally optimal hand-eye calibration. In CVPR 2012: Proceedings of the 2012
IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
IEEE Computer Society, June 2012, to appear. Authorship: 34-33-33.
[16] Michal Havlena, Jan Heller, Hendrik Kayser, Jörg-Hendrik Bach, Jörn Anemüller,
and Tomáš Pajdla. Incongruence detection in audio-visual processing. In Daphna
Weinshall, Jörn Anemüller, and Luc Van Gool, editors, Detection and Identification of Rare Audiovisual Cues, volume 384 of Studies in Computational Intelligence, pages 67–75, Berlin, Germany, September 2012. Universitat Politècnica de
Catalunya – BarcelonaTech (UPC), Springer-Verlag. Authorship: 17-17-17-1716-16.
[17] Tomáš Pajdla, Michal Havlena, and Jan Heller. Learning from incongruence.
In Daphna Weinshall, Jörn Anemüller, and Luc Van Gool, editors, Detection
and Identification of Rare Audiovisual Cues, volume 384 of Studies in Computational Intelligence, pages 119–127, Berlin, Germany, September 2012. Universitat
Politècnica de Catalunya – BarcelonaTech (UPC), Springer-Verlag. Authorship:
34-33-33.
[18] Jan Heller, Michal Havlena, Akihiro Sugimoto, and Tomáš Pajdla. Structurefrom-motion based hand-eye calibration using L∞ minimization. In Pedro Felzenszwalb, David Forsyth, and Pascal Fua, editors, CVPR 2011: Proceedings of the
2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3497–3503, Los Alamitos, USA, June 2011. IEEE Computer Society,
IEEE Computer Society. Authorship: 25-25-25-25.
[19] Michal Havlena, Andreas Ess, Wim Moreau, Akihiko Torii, Michal Jančošek,
Tomáš Pajdla, and Luc Van Gool. AWEAR 2.0 system: Omni-directional audiovisual data acquisition and processing. In EGOVIS 2009: Proceedings of the First
Workshop on Egocentric Vision, pages 49–56, Madison, USA, June 2009. Omnipress. Authorship: 15-15-14-14-14-14-14.
18
Other conference/congress publications
[20] Jörn Anemüller, Jörg-Hendrik Bach, Barbara Caputo, Michal Havlena, Luo Jie,
Hendrik Kayser, Bastian Leibe, Petr Motlicek, Tomas Pajdla, Misha Pavel, Akihiko Torii, Luc Van Gool, Alon Zweig, and Hynek Hermansky. The DIRAC
AWEAR audio-visual platform for detection of unexpected and incongruent
events. In Vassilios Digalakis, Alexandros Potamianos, Matthew Turk, Roberto
Pieraccini, and Yuri Ivanov, editors, ICMI 2008: Proceedings of the 10th International Conference on Multimodal Interfaces, pages 289–292, New York, USA,
October 2008. Association for Computing Machinery, ACM. Authorship: 8-8-77-7-7-7-7-7-7-7-7-7-7.
Citations of Author’s Work
[1] Thomas Albrecht, Tele Tan, Geoff A. W. West, and Thanh Ly. Omnidirectional
video stabilisation on a virtual camera using sensor fusion. In ICARCV 2010: Proceedings of the 11th International Conference on Control, Automation, Robotics
and Vision, pages 2067–2072, 2010.
[2] Jean-Charles Bazin, Cedric Demonceaux, Pascal Vasseur, and Inso Kweon. Rotation estimation and vanishing point extraction by omnidirectional vision in urban
environment. International Journal of Robotics Research, 31(1):63–81, January
2012.
[3] Jing Chen and Baozong Yuan. Metric 3D reconstruction from uncalibrated unordered images with hierarchical merging. In B.Z. Yuan, Q.Q. Ruan, and X.F.
Tang, editors, ICSP 2010: Proceedings of the 2010 IEEE 10th International Conference on Signal Processing, Vols I-III, pages 1169–1172, 2010.
[4] Shengyong Chen, Yuehui Wang, and Carlo Cattani. Key issues in modeling of
complex 3D structures from video sequences. Mathematical Problems in Engineering, 2012.
[5] Jerome Courchay, Arnak Dalalyan, Renaud Keriven, and Peter Sturm. Exploiting
loops in the graph of trifocal tensors for calibrating a network of cameras. In
K. Daniilidis, P. Maragos, and N. Paragios, editors, Computer Vision - ECCV 2010,
11th European Conference on Computer Vision, Proceedings, Part II, volume 6312
of Lecture Notes in Computer Science, pages 85–99, 2010.
[6] Jerome Courchay, Arnak S. Dalalyan, Renaud Keriven, and Peter Sturm. On camera calibration with linear programming and loop constraint linearization. International Journal of Computer Vision, 97(1):71–90, March 2012.
19
[7] Jean-Lou De Carufel and Robert Laganiere. Matching cylindrical panorama sequences using planar reprojections. In OMNIVIS ’11: 11th IEEE Workshop on
Omnidirectional Vision, Camera Networks and Non-classical Cameras, 2011.
[8] Weijia Feng, Juha Roning, Juho Kannala, Xiaoning Zong, and Baofeng Zhang. A
general model and calibration method for spherical stereoscopic vision. In J. Roning and D.P. Casasent, editors, Intelligent Robots and Computer Vision XXIX: Algorithms and Techniques, volume 8301 of Proceedings of SPIE, 2012.
[9] Danilo Hollosi, Stefan Wabnik, Stephan Gerlach, and Steffen Kortlang. Catalog of
basic scenes for rare/incongruent event detection. In D. Weinshall, J. Anemuller,
and L. Van Gool, editors, Detection and Identification of Rare Audiovisual Cues,
volume 384 of Studies in Computational Intelligence, pages 77–84. SpringerVerlag, 2012.
[10] Maxime Lhuillier. A generic error model and its application to automatic 3D modeling of scenes using a catadioptric camera. International Journal of Computer
Vision, 91(2):175–199, January 2011.
[11] Christopher Rasmussen, Yan Lu, and Mehmet Kocamaz. Trail following with
omnidirectional vision. In IROS 2010: Proceedings of the IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, 2010.
[12] Richard Roberts, Sudipta N. Sinha, Richard Szeliski, and Drew Steedly. Structure from motion for scenes with large duplicate structures. In CVPR 2011: Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2011.
[13] Wolfgang Stuerzl, Darius Burschka, and Michael Suppa. Monocular ego-motion
estimation with a compact omnidirectional camera. In IROS 2010: Proceedings of
the IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems,
pages 822–828, 2010.
[14] Yuan-Kai Wang, Ching-Tang Fan, Shao-Ang Chen, and Hou-Ye Chen. X-Eye:
A novel wearable vision system. In N. Kehtarnavaz and M.F. Carlsohn, editors,
Real-time Image and Video Processing 2011, volume 7871 of Proceedings of SPIE,
2011.
[15] Marc Wieland, Massimiliano Pittore, Stefano Parolai, Jochen Zschau, Bolot
Moldobekov, and Ulugbek T. Begaliev. Estimating building inventory for rapid
seismic vulnerability assessment: Towards an integrated approach based on multisource imaging. Soil Dynamics and Earthquake Engineering, 36:70–83, May
2012.
[16] Christopher Zach, Manfred Klopschitz, and Marc Pollefeys. Disambiguating visual relations using loop constraints. In CVPR 2010: Proceedings of the 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
pages 1426–1433, 2010.
20
Download