Structure-from-Motion • Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving camera. – Equivalently, we can think of the world as moving and the camera as fixed. • Like stereo, but the position of the camera isn’t known (and it’s more natural to use many images with little motion between them, not just two with a lot of motion). – We may or may not assume we know the parameters of the camera, such as its focal length. Structure-from-Motion • As with stereo, we can divide problem: – Correspondence. – Reconstruction. • Again, we’ll talk about reconstruction first. – So for the next few classes we assume that each image contains some points, and we know which points match which. Structure-from-Motion … Movie Reconstruction • A lot harder than with stereo. • Start with simpler case: scaled orthographic projection (weak perspective). – Recall, in this we remove the z coordinate and scale all x and y coordinates the same amount. Announcements • Quiz on April 22 in class (a week from Thursday). – Practice problems handed out Thursday. – Reviews at office hours, possibly in seminar room across the hall from my office. • Questions about PS6 First: Represent motion • We’ll talk about a fixed camera, and moving object. • Key point: Some matrix x1 y1 P z 1 1 Points x2 y2 . . z2 1 . xn yn zn 1 s S s 1, 2 2 ,1 2,2 s s 1, 3 2,3 The image u u . . . I v v 1 2 I SP 1 Then: s s 1,1 2 t t x y u v n n Structure-from-Motion • S encodes: – Projection: only two lines – Scaling, since S can have a scale factor. – Translation, by tx/s and ty/s. – Rotation: I SP Rotation r1,1 r2,1 r 3,1 r1, 2 r2, 2 r3, 2 r1,3 r2,3 P r3,3 Represents a 3D rotation of the points in P. First, look at 2D rotation (easier) Matrix R acts on points by rotating them. cos sin cos R sin sin x1 cos y1 x2 y2 . sin cos . . xn yn • Also, RRT = Identity. RT is also a rotation matrix, in the opposite direction to R. Simple 3D Rotation cos sin 0 sin cos 0 0 x 0 y 1 z 1 1 1 x y z 2 . . . 2 2 Rotation about z axis. Rotates x,y coordinates. Leaves z coordinates fixed. x y z n n n Full 3D Rotation cos R sin 0 sin cos 0 0 cos 0 0 1 sin 0 sin 1 0 1 0 0 cos 0 cos 0 sin 0 sin cos • Any rotation can be expressed as combination of three rotations about three axes. 1 0 0 RR 0 1 0 0 0 1 T • Rows (and columns) of R are orthonormal vectors. • R has determinant 1 (not -1). Questions? Putting it Together 3D Translation r Scale 1 0 0 t r 1 0 0 s 0 1 0 t r 0 1 0 0 0 1 t Projection 0 1,1 x 2 ,1 y 3 ,1 z r r r 0 1, 2 2,2 3, 2 r r r 0 1, 3 2,3 3,3 0 0 P 0 1 3D Rotation s s s s where 1,1 1, 2 s 2 ,1 2,2 s 1, 3 2,3 st P st x y (s , s , s ) (s , s , s ) 0 1,1 1, 2 1, 3 2 ,1 2,2 2,3 (s , s , s ) (s , s , s ) 1,1 1, 2 1, 3 2 ,1 2,2 2,3 We can just write stx as tx and sty as ty. Affine Structure from Motion s s s s where 1,1 1, 2 s 2 ,1 2,2 s t P t 1, 3 x 2,3 y (s , s , s ) (s , s , s ) 0 1,1 1, 2 1, 3 2 ,1 2,2 2,3 (s , s , s ) (s , s , s ) 1,1 1, 2 1, 3 2 ,1 2,2 2,3 Affine Structure-from-Motion: Two Frames (1) u v u v 1 1 1 1 2 1 2 1 u v u v 1 2 1 2 2 2 2 2 . . . u v u v 1 n 1 n 2 n 2 n s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t 1 x 1 y 2 x 2 y x y z 1 1 1 1 x y z 1 2 2 2 . . . x y z 1 n n n Affine Structure-from-Motion: Two Frames (2) x y z 1 1 To make things easy, suppose: 1 1 x y z 1 2 2 2 x y z 1 3 3 3 x 0 y 0 z 0 1 1 4 4 4 1 0 0 1 0 1 0 1 0 0 1 1 Affine Structure-from-Motion: Two Frames (3) u v u v u v u v 1 1 1 1 . 1 2 . . u v u v 1 2 2 1 2 1 1 n 1 n 2 2 2 2 2 n 2 n s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t x y z 1 1 x 1 1 y 1 2 x 1 2 y x y z 1 . 1 0 0 1 0 1 0 1 2 . 2 1 1 1 1 2 1 2 1 u v u v 1 2 1 2 2 2 2 2 u v u v 1 3 1 3 2 3 2 3 u v u v 1 4 1 4 2 4 2 4 s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t 1 x 1 y 2 x 2 y 0 0 0 1 x y z 1 n n 2 n Looking at the first four points, we get: u v u v . 0 0 1 1 Affine Structure-from-Motion: Two Frames (5) u v u v 1 k 1 k 2 k 2 k s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t 1 x 1 y 2 x 2 y x y z 1 k k k Once we know the motion, we can use the images of another point to solve for the structure. We have four linear equations, with three unknowns. Affine Structure-from-Motion: Two Frames (7) But, what if the first four points aren’t so simple? Then we define A so that: x y A z 1 1 1 1 x y z 1 2 2 2 x y z 1 3 3 3 x 0 y 0 z 0 1 1 4 4 4 1 0 0 1 0 1 0 1 0 0 1 1 This is always possible as long as the points aren’t coplanar. Affine Structure-from-Motion: Two Frames (8) u v u v Then, given: 1 1 1 1 2 2 1 We have: u v u v 1 1 1 1 2 2 1 1 1 And: 1 1 2 1 2 1 . 2 2 2 2 . 1 2 1 2 . . u v u v 1 2 2 2 2 2 1 n 1 n 2 2 2 2 n 2 n . 1 n 1 2 n 2 n 1 . u v u v n 2 u v u v . 1 2 1 u v u v . 1 2 2 1 u v u v u v u v . u v u v 1 n 1 n 2 n 2 n s s s s s s s s 2 ,1 2,2 2 1,1 1, 2 2 2 ,1 1 1 2 ,1 2 1,1 2 2 ,1 1 1, 2 1 2,2 2 1, 2 2 2,2 1 2 1, 3 2 2,2 s s s s t t t t 1 2,3 2 2 2,3 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t x y A A z 1 x y z 1 . 1 2 2 2,3 1, 3 1 . 1, 3 2 s s s s 1 x y z 1 2,3 2 2,2 1, 2 1 1 1, 2 2 x y z 1 1 1, 3 2,2 2 2 ,1 1,1 1,1 1 1,1 1 s s s s 1 1, 2 2 ,1 s s s s s s s s s s s s 1 1,1 1 x 1 y 2 x 2 y 1 x y 2 x 1 y 2 x 2 y 1 2 1 1 1 1 0 0 0 1 . 2 . . x y z 1 n n 2 0 0 1 1 n n 2 0 1 0 1 x y z 1 n 2 1 1 0 0 1 . 2 1 2 A 1 x 1 1 y t t t t n . . . x y z 1 n n n Affine Structure-from-Motion: Two Frames (9) u v u v 1 1 Given: 1 1 2 1 2 1 u v u v . 1 2 . . u v u v 1 2 1 n 1 n 2 2 2 2 2 n 2 n s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s s s s s 1 1, 2 1 1, 3 1 2,2 1 2,3 2 1, 2 2 1, 3 2 2,2 2 2,3 t t t t 1 x 1 y 2 x 2 y A 1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 . Then we just pretend that: s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t 1 x 1 y 2 x 2 y A 1 is our motion, and solve as before. . . x y z 1 n n n Affine Structure-from-Motion: Two Frames (10) This means that we can never determine the exact 3D structure of the scene. We can only determine it up to some transformation, A. Since if a structure and motion explains the points: u v u v 1 1 1 1 2 1 2 1 u v u v 1 2 . . . u v u v 1 2 n 1 n 2 2 2 2 So does another of the form: 1 2 n 2 n u v u v 1 1 1 1 2 1 2 1 u v u v 1 2 1 2 2 2 2 2 s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 . . . s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 u v u v 1 n 1 n 2 n 2 n s s s s t t t t 1 1, 3 1 2,3 2 1, 3 2 2,3 s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 1 x 1 y 2 x 2 y x y z 1 x y z 1 1 1 1 1, 2 1 2,2 2 1, 2 2 2,2 . x y z 1 . n 2 1 s s s s . 2 n 2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 n t t t t 1 x 1 y 2 x 2 y A x y A z 1 1 1 1 1 x y z 1 2 2 2 . . . x y z 1 n n n Affine Structure-from-Motion: Two Frames (11) u v u v 1 1 1 1 2 1 2 1 u v u v 1 2 1 2 2 2 2 2 . . . u v u v 1 n 1 n 2 n 2 n s s s s Note that A has the form: A corresponds to translation of the points, plus a linear transformation. 1 1,1 1 2 ,1 2 1,1 2 2 ,1 s s s s 1 1, 2 1 2,2 2 1, 2 2 2,2 s s s s 1 1, 3 1 2,3 2 1, 3 2 2,3 t t t t 1 x 1 y 2 x 2 y A a a A a 0 1,1 2 ,1 3 ,1 x y A z 1 1 1 1 1 a a a 0 1, 2 2,2 3, 2 x y z 1 2 . . x y z 1 . n 2 n 2 n a a a 0 1, 3 2,3 3,3 a a a 1 1, 4 2,4 3, 4 Let’s take an explicit example of this. Suppose we have a cube with vertices at: (0,0,0) (10,0,0), (0,0,10)….. Suppose we transform this by rotating it by 45 degrees about the y axis. Then the transformation matrix is: [sq2 0 –sq2 0; 0 1 0 0]. Now suppose instead we had a rectanguloid with corners at (10, 0, 0) (11,0,0), (10, 0, 10) …. We can transform this rectanguloid into the first cube by transforming it as: [10 0 0 -100; 0 1 0 0; 0 0 1 0; 0 0 0 1]. Then we can apply [sq2 0 –sq2 0; 0 1 0 0] to the resulting cube, to generate the same image. Or, we could have combined these two transformations into one: [sq2 0 –sq2 0; 0 1 0 0]* [10 0 0 -100; 0 1 0 0; 0 0 1 0; 0 0 0 1] = [10sq2 0 –sq2 0; 0 1 0 0] Affine Structure-from-Motion: A lot of frames (1) u v u v . . . u v 1 1 1 1 2 1 2 1 m 1 m 1 u v u v 1 2 . . . 1 2 2 2 u v u 1 n 1 n 2 n 2 v . . . u v . . . u v 2 m 2 n m 2 m 2 n . . I . m n s s s s s s 1 1,1 1 2 ,1 2 1,1 2 2 ,1 m 1,1 m 2 ,1 s s s 1 1, 2 1 2,2 2 1, 2 s . . . s s 2 2,2 m 1, 2 m 2,2 S s s s 1 1, 3 1 2,3 2 1, 3 s 2 2,3 x t y z 1 t t t t t 1 x 1 y 2 x 2 1 y 1 1 s s m 1, 3 m 2,3 x y z 1 2 . 2 . . x y z 1 n n 2 n m x m y P First Step: Solve for Translation (1) • This is trivial, because we can pick a simple origin. – World origin is arbitrary. – Example: We can assume first point is at origin. • Rotation then doesn’t effect that point. • All its motion is translation. – Better to pick center of mass as origin. • Average of all points. • This also averages all noise. More explicitly, suppose sum(p) = (0,0,0,n)^T. Then, sum(R*P) = R*(sum(P)) = R*(0,0,0,n)^T = (0,0,0,n)^T. Sum(T*R*P) = T*(0,0,0,n)^T = (ntx,nty,ntz,n)^T. (Or just look at the 2x4 projection matrix). If we subtract tx or ty from every row, then the residual is (s11,s12,s13;s21,s22,s23)*P. I = s part of matrix + t part of matrix. r 1 0 0 t r 1 0 0 s 0 1 0 t r 0 1 0 0 0 1 t 0 1,1 x 2 ,1 y 3 ,1 z r r r 0 1, 2 2,2 3, 2 r r r 0 1, 3 2,3 3,3 0 0 P 0 1 First Step: Solve for Translation (2) x 0 n i WLOG : y 0 i 1 i 0 z i n u i u k 1 i k n u v u v ~ I u v 1 1 1 1 2 1 2 1 m 1 m n v i v k 1 i 1 k n u v u 1 v 2 . . . u v 1 u u v v u u 1 v v 2 1 1 2 1 . . . 2 2 2 m m m 2 2 2 2 n m m 2 m 1 n 2 . . . u u v v 1 n 2 2 1 n 1 2 2 u u v v u u v v . . . u u v v m n m . . . m n m First Step: Solve for Translation (3) u~ ~ v u~ ~ v . . . u~ v~ 1 1 2 u~ v~ u~ v~ . . . u~ v~ . . . u~ v~ 1 2 1 1 . . . 1 2 2 1 2 m 1 2 2 n 1 2 2 2 m m n . . . n 1 1,1 1 2 1,1 2 2 ,1 n m s s s s s s 2 ,1 n 2 m 1 n 2 1 1 u~ v~ u~ v~ m m 1,1 m 2 ,1 s 1 1, 2 s s s . . 1 2,2 2 1, 2 2 2,2 . s s m 1, 2 m 2,2 s s s s x y z s s 1 1, 3 1 2,3 2 1, 3 1 x y 1 z 2 2,3 1 2 . 2 2 m 1, 3 m 2,3 As if by magic, there’s no translation. . . x y z n n n Rank Theorem u~ ~ v u~ ~ v . . . u~ v~ 1 1 2 u~ v~ u~ v~ . . . u~ v~ . . . u~ v~ 1 2 1 1 . . . 1 2 2 1 2 m 1 2 2 1 2 2 2 m m n . . ~ I . n 1 1,1 1 2 1,1 2 2 ,1 n m s s s s s s 2 ,1 n 2 m 1 n n 2 1 1 u~ v~ u~ v~ m m 1,1 m 2 ,1 s 1 1, 2 s s s . . 1 2,2 2 1, 2 2 2,2 . s s m 1, 2 m 2,2 S s s s s x y z s s 1 1, 3 1 2,3 2 1, 3 1 x y 1 z 2 2,3 m 1, 3 m 2,3 1 2 . 2 . . x y z n n 2 n P ~ I has rank 3. This means there are 3 vectors such that every ~ row of I is a linear combination of these vectors. These vectors are the rows of P. Solve for S • SVD is made to do this. ~ I UDV D is diagonal with non-increasing values. U and V have orthonormal rows. Ignoring values that get set to 0, we have U(:,1:3) for S, and D(1:3,1:3)*V(1:3,:) for P. Linear Ambiguity (as before) ~ I = U(:,1:3) * D(1:3,1:3) * V(1:3,:) ~ I = (U(:,1:3) * A) * (inv(A) *D(1:3,1:3) * V(1:3,:)) Noise ~ • I has full rank. • Best solution is to estimate I that’s as near to ~ as possible, with estimate of I having rank 3. I • Our current method does this. Weak Perspective Motion u~ ~ v u~ ~ v . . . u~ v~ 1 1 2 u~ v~ u~ v~ . . . u~ v~ . . . u~ v~ 1 2 1 1 . . . 1 2 2 1 2 m 1 2 2 1 2 2 2 m m n . . ~ I . n 1 1,1 1 2 1,1 2 2 ,1 n m s s s s s s 2 ,1 n 2 m 1 n n 2 1 1 u~ v~ u~ v~ m m 1,1 m 2 ,1 s 1 1, 2 s s s . . 1 2,2 2 1, 2 2 2,2 . s s m 1, 2 m 2,2 s s s s x y z s s 1 1, 3 1 2,3 2 1, 3 1 x y 1 z 2 2,3 m 1, 3 1 2 . 2 . . x y z n n 2 n P m 2,3 S ~ =(U(:,1:3)*A)*(inv(A) *D(1:3,1:3)*V(1:3,:)) I Row 2k and 2k+1 of S should be orthogonal. All rows should be unit vectors. (Push all scale into P). Choose A so (U(:,1:3) * A) satisfies these conditions. Related problems we won’t cover • Missing data. • Points with different, known noise. • Multiple moving objects. Final Messages • Structure-from-motion for points can be reduced to linear algebra. • Epipolar constraint reemerges. • SVD important. • Rank Theorem says the images a scene produces aren’t complicated (also important for recognition).