Geometry 4: Multiview Stereo Introduction to Computer Vision Ronen Basri Weizmann Institute of Science Material covered • Pinhole camera model, perspective projection • Two view geometry, general case: • Epipolar geometry, the essential matrix • Camera calibration, the fundamental matrix • Two view geometry, degenerate cases • Homography (planes, camera rotation) • A taste of projective geometry • Stereo vision: 3D reconstruction from two views • Multi-view geometry, reconstruction through factorization Structure from motion • Input: • a set of point tracks • Output: • 3D location of each point (shape) • camera parameters (motion) • Assumptions: • Rigid motion • Orthographic projection (no scale) • Method: SVD factorization (Tomasi & Kanade) Setup • πΌ1 , πΌ2 , … , πΌπ : a collection of images (video frames) depicting a rigid scene • π point tracks in those π frames: πππ = (π₯ππ , π¦ππ )π the location of ππ at frame π • Unknown 3D locations: ππ = (ππ , ππ , ππ )π ∈ β3 , π = 1, … , π • Therefore, π₯ππ = ππ π ππ + ππ π¦ππ = ππ π ππ + ππ ππ π , ππ π are the two top rows of a 3 × 3 rotation matrix Objective Find ππ ππ ∈ β3 and ππ , ππ ∈ β that minimize π π π (ππ ππ + ππ ) − π₯ππ 2 π + (ππ ππ + ππ ) − π¦ππ π=1 π=1 Subject to ππ = ππ = 1 ππ π ππ = 0 2 Eliminate translation • We can eliminate translation by representing the location of each point relative to the centroids of all π points • Assume without loss of generality that the centroid of π1 , … , ππ coincides with the origin π ∈ β3 • Translate each image point by setting π₯ππ = π₯ππ − π₯π π¦ππ = π¦ππ − π¦π (π₯π , π¦π ) denotes the centroid of (π₯ππ , π¦ππ ) Objective (no translation) Find ππ ππ ∈ β3 that minimize π π π ππ ππ − π₯ππ 2 π + ππ ππ − π¦ππ π=1 π=1 Subject to ππ = ππ = 1 ππ π ππ = 0 2 Measurement matrix π= π₯11 … π₯π1 π¦11 .. π¦π1 π₯12 . . . π₯π2 π¦12 . . . . . . π¦π2 . . . π₯1π … π₯ππ π¦1π … π¦ππ 2π×π Transformation and shape matrices π= π1 π … ππ π π1 π … ππ π π1 π = π1 π1 = π2 π2 π2 π11 … ππ1 π 11 … π π1 . π12 π13 … ππ3 π 13 … π π3 ππ2 π 12 π π2 . . ππ ππ ππ 2π×3 3×π Objective: matrix notation Find π and π that minimize π − ππ πΉ Subject to ππ = ππ = 1 ππ π ππ = 0 π is 2π × π, π is 2π × 3, π is 3 × π π = ππ + Noise π₯11 π₯12 … π₯π1 π₯π2 π¦11 π¦12 .. π¦π1 π¦π2 π11 π12 … ππ1 ππ2 = π 11 π 12 … π π1 π π2 . . . . . . . . . . . . π13 … ππ3 π 13 … π π3 π₯1π … π₯ππ π¦1π … π¦ππ π1 π1 π1 2π×3 2π×π … … ππ ππ ππ + Noise 3×π TK-Factorization π = ππ + Noise Step 1: find rank 3 approximation to π using SVD π = πΣπ π where π is 2π × 2π, π π π = πΌ, Σ = diag(π1 , π2 , … ), size 2π × π, and π1 ≥ π2 ≥ β― ≥ 0 π is π × π, π π π = πΌ TK-Factorization π = πΣ3 π π where Σ3 = diag(π1 , π2 , π3 , 0, 0, … ) Note: this is a relaxation, only noise components outside the 3D space are annihilated Step 2: factorization π = π Σ3 Ambiguity: π= Σ3 π π π = (ππ΄)(π΄−1 π) for any non-singular, 3 × 3 matrix π΄ TK-Factorization Step 3: resolve ambiguity ππ = ππ = 1 ππ π ππ = 0 ππ π Let π π = ππ π ππ π Let ππ = ππ π , note that π π π π π = πΌ 2×3 be the corresponding rows in π, then 2×3 π π = ππ π΄ Find a 3 × 3 symmetric matrix π΄π΄π π π ππ π΄π΄ ππ = π π π π π = πΌ TK-Factorization π ππ π΄π΄ ππ = π π π π π = πΌ • Equation is linear in π΄π΄π • There are 3π equations in 6 unknowns • Find π΄ by eigen-decomposition π΄π΄π = πβπ π so that π΄=π β • Solution is obtained up to a rotation ambiguity π π π ππ (π΄π΅)(π΅ π΄ )ππ such that π΅π΅π = πΌ π TK-Factorization: Summary 1. Eliminate translation, construct π 2. πππ·(π) to get rank 3 π and factorize π = ππ (3 × 3 ambiguity π΄ remains) 3. Resolve ambiguity: estimate π΄π΄π by exploiting orthonormality of each rotation, then factorize to obtain π΄ Final solution up to rotation and reflection TK-Factorization: pros and cons • Advantages: • Breaks a difficult, non-linear optimization into simple optimization steps • Works well with errors • Disadvantage: • Orthographic projection • Requires complete tracks Factorization with incomplete tracks • Need a way to approximate by a low rank matrix with missing data min π β (π − π) rank π =3 π a mask, πππ = 1 wherever πππ is known • This problem is NP-hard • Surrogate: minimize the nuclear norm – sum of singular values, π1 + π2 + π3 + β― • Nuclear norm is convex, minimization often achieves low rank • Better iterative procedures exist Perspective multiview stereo • A point π = (π, π, π) is projected to ππ ππ π₯= π¦= π π • A point rotated by π and translated by π projects to π(π2 π π + π‘π¦ ) π(π1 π π + π‘π₯ ) π₯= π¦= π π3 π + π‘π§ π3 π π + π‘π§ ππ π denotes the rows of π Bundle adjustment • Given π points in π frames (π₯ππ , π¦ππ ) find camera matrices πΆπ and positions ππ that minimize π π π=1 π=1 π π(ππ1 ππ + π‘π₯ ) − π₯ππ π ππ3 ππ + π‘π§ 2 π π(ππ2 ππ + π‘π¦ ) + − π¦ππ π ππ3 ππ + π‘π§ • Alternate optimization • Given π π and ππ , solve for ππ • Given ππ solve for π π and ππ • Very good initial guess is required 2 Bundler (Photo Tourism) (Snavely et al.) Bundler (Photo Tourism) • Given images, identify feature points, describe them with SIFTs • Match SIFTs, accept each match ππ ↔ ππ whose score is at least twice of any other match ππ ↔ ππ • For every pair of images with sufficiently many matches use RANSAC to recover Essential matrices • Starting with two images and adding one image at a time: use essential matrix to recover depth and apply bundle adjustment Simultaneous solutions • πΈππ = πππ π ππ : Essential matrix between πΌπ and πΌπ , × π, π = 1, … , π, available on a subset of image pairs • Objective: recover camera orientation π π and location ππ relative to a global coordinate system • First step: recover rotations: π min π ππ − π π π π π π πΉ • This can be solved in various ways, for example min π ππ π π − π π : least squares solution if we π π πΉ ignore the orthonormality constraints for π π Epipolar relation in global coordinates • The epipolar line relation, ππ πΈππ π = 0 can be written in a global coordinate system as follows ππ π π π ππ × − ππ π π π = 0 × • This generalizes the formula for the essential matrix (plug in π π = πΌ, ππ = π) • Once camera orientations π π are known we can solve for camera locations (equation is linear and homogeneous in the translation components) • Solution suffers from shrinkage problems Multiview reconsruction