CS3432: Lecture 21 Stereo Vision (part 2) In the notes to lecture 20 we were introduced to the pinhole camera model and a simple (rather improbable) stereo geometry. Here we generalise the geometry, and see that stereo reconstruction is still straightforward, but that we need to calibrate the values of a rather larger number of parameters. Stereo geometry revisited. The general stereo geometry is shown in the figure below. The left and right image planes are no longer parallel and aligned. Scene point P, a vector with scene coordinates(X,Y,Z), projects into image points pl and pr with coordinates (xl, yl and xr, yr) in the coordinate systems of the left and right images respectively. The optical centres Ol and Or are separated from the image planes by the focal lengths, fl and fr. Thus the image points, pl and pr, can be represented by the three-dimensional coordinates (xl, yl, fl and xr, yr, fr). The vector to the scene point in each of the camera coordinate systems is Pl and Pr respectively. Any point along the vector Pl, say, is just a scalar multiple of pl. That is Pl = al p l , where al is a scalar. Similarly, Pr = a r p r . P(X,Y,Z) Pr(xr, yr, zr) Left image plane Pl(xl, yl, zl) zl fl pl Ol yl xl Right image plane pr Z Tl Rl X Y Tr fr zr Or y r Rr xr Coordinate transformations In this more generalised geometry, we need to be much clearer about the coordinate systems being used than we were in the simple case. The figure shows three coordinate systems. The coordinates of each of the left and right cameras (xl, yl, zl and xr, yr, zr), and the scene (or world) coordinates (X,Y,Z). As we can see the origins of these coordinate systems do not coincide, and their axes are not parallel. We need the scene coordinate system because ultimately we want to know the positions in space of points in the scene independently of where we chose to put the cameras. To do this we need to know the relationships between vectors in the scene coordinates system and vectors in each of the camera coordinate systems. This relationship will be expressed as a translation vector between the scene origin and the camera origin (Tl and Tr), and a 3D rotation matrix (Rl and Rr). These transformations can be determined by calibration (below). Stereo reconstruction Reconstruction is simply a matter of finding the point of intersection of the vectors Pl and Pr. These, however are represented in the left and right camera coordinates. We need the vectors expressed in scene coordinates. We can write these as: Pl′ = Tl + Rl Pl and Pr′ = Tr + Rl Pr . The reconstructed point is the intersection of these two vectors. i.e. at the point where Pl′ = Pr′ Since we know that Pl = al p l and Pr = a r p r we can find the point of intersection of the vectors by the following set of linear equations: Tl + al Rl p l = Tr + a r Rr p r . Recall that pl and pr are the vectors (xl, yl, fl) and (xr, yr, fr). We can measure xl, yl, xr and yr. If we know the coordinate transformations,(Tl, Tr, Rl and Rr) and the focal lengths (fl and fr) we can solve this set of linear equations for al and ar. The next section indicates how we can determine these parameters. The solution is slightly more complicated in practice. Because of measurement errors and calibration errors, the projected vectors, Pl′ and Pr′ do not, in general, intersect in space. We need to find the mid-point of the shortest line joining them. This can also be achieved by solving a system of linear equations. Camera Calibration The above equations relate the scene coordinates of points in space to the image coordinates of the projections of those points in the left and right images. To solve them we need to know the camera parameters. We can determine the camera parameters using the same relationships if we know accurate scene coordinates corresponding to image points. We can classify camera parameters in two types, extrinsic and intrinsic, Exntrinsic parameters The extrinsic parameters relate the camera coordinate systems to the scene coordinates. They are the translation vector T (Tx, Ty, Tz) and the rotation matrix R for each camera. R is a 3×3 matrix, but there are three free parameters (the angles through which the coordinate system rotates). Intrinsic parameters Pixel grid Image Plane Grid origin dx dy Origin shift Image Origin sx sy Pixel dimension The intrinsic parameters are internal properties of the camera. One of these is the focal length f, which we need in order to solve the reconstruction equations. There are others needed to relate the image coordinates to the pixel coordinates measured by the detector. The image plane is a geometrical construct determined by the optics; the sensor is a physical object located on the image plane (we hope). At the very least we need a scale factor to turn the dimensionless pixel values into a meaningful unit of measurement, such as millimetres (the millimetre spacing of the elements of the CCD camera). In practice we may need two scale factors, sx and sy, since the pixel spacing in the two directions may be different. The origin of the image plane and the origin of the sensor would ideally coincide, but in practice there may be a shift between them, requiring us to know two other parameters, dx and dy, the offset of the sensor origin relative to the origin of the image plane. These parameters are illustrated in the figure. The red rectangle represents the image plane. The black pixel grid is superimposed on it. The parameters f, sx and sy are interdependent. We only need to know f and the ratio sx /sy. For complete accuracy we need to consider further parameters that estimate the magnitude of distortions introduced by the lenses. Here we are simply looking at the nature of the problem, so we won’t create further complexity by considering these. Calibration There are a number of algorithms for determining these parameters. All of them need a calibration target – some carefully manufactured object containing features that can be easily and accurately located in images, and whose coordinates have been measured accurately relative to its own (scene) origin. The figure shows left and right images of one form of calibration target. This consists of a set of black squares on a light background. The positions of the corners of the squares have been measured accurately. These points can be located easily, and to sub-pixel precision, using (say) a Canny edge detector. Provided the planar target is oriented at a suitable angle to both cameras, the relationships between the 3D scene coordinates and the image coordinates in the two views can be used to calculate the extrinsic and intrinsic parameters. Different calibration algorithms are compromises between the accuracy of calibration required, the complexity of the algorithm and the constraints on the target. The planar target shown in the figure is easy to manufacture and measure. The algorithm that makes use of it is fairly straightforward, but there is a limit on the accuracy with which the parameters can be determined. It turns out that this is accurate enough for most measurement purposes. If very high accuracy is required, then more complex algorithms are used, involving multi-parameter non-linear optimisation. These are less robust in terms of their tolerance to variation in target position and the accuracy with which the target points are known. They also require 3D targets which are more difficult to engineer and measure. Epipolar Geometry In our simple stereo geometry we noted that corresponding points lie on the same y-line in the left and right images. We called these epipolar lines. In our more general camera geometry, the epipolar lines are no longer parallel to the image x-axes. The figure below shows the camera geometry again. This time the plane connecting the scene point P to the optic centres of the two images, Ol and Or, has been shaded. This is the epipolar plane. Clearly the two projections of P (pl and pr) both lie on this plane on the lines where it intersects the image planes. These lines are the epipolar lines in this geometry. The line connecting the optic centres, Ol and Or, intersects the left and right image planes at the epipoles (el and er). This intersection may not be within the field of view (the image planes are geometrical constructs, infinite in extent). The image planes are shown extended in the figure. The left epipole is the projection of the optic centre of the right camera in the left camera’s image plane and vice versa. Notice that the line connecting Ol and Or is always part of the epipolar plane for every scene point. That is, all epipolar lines in a given image pass through the epipole. Search for correspondences is still one-dimensional along the epipolar lines. For edge search it is not too much of a problem that the epipolar lines are tilted with respect to the image axes. For correlation-based correspondence it may be useful to rectify the images. Rectification performs a warping of one or both of the images to have the effect of rotating the image planes to be parallel to each other, putting the epipoles at infinity and generating epipolar lines parallel to the image x-axis. This makes search rather more convenient. P(X,Y,Z) Pl Epipolar plane Pr pl pr Ol el Epipolar lines er Or Epipoles A Stereo Inspection Example The figures below (a,b) show stereo images of a 3D part (a component of a motor-car steering column). If we wish to know the three dimensional orientation (pose) of this object, we could match a 3D model of its expected appearance to the 3D positions of features in the scene. The model in this case consists of a set of circles and lines in 3D corresponding to the important features of the object. The second pair of images (c,d) shows edges detected in the stereo images with some epipolar lines superimposed. The epipolar lines are close to parallel to the x-axis because the two cameras are situated close together looking at the object some distance away. The next pair of images shows the 3D edge positions determined by stereo reconstruction (e) and some features (lines and circles) extracted from these edges (f) to reduce the amount of uninformative data in the 3D scene. The last pictures (g,f) show the projection of the 3D model onto the original 2D images after the pose has been refined. a b e f c d g h Trinocular Stereo There is no need to be limited to two cameras. Indeed the correspondence problem can be made more tractable by using three views (trinocular stereo). In this case there are three sets of epipolar lines. A point in image 1 corresponds to lines in images 2 and 3. Each of those lines projects to a point in the other images, so that Epipolar lines correspondence matching could, in principle, be determined totally from epipolar geometry. In practice where trinocular stereo is used, the third epipolar line is used to verify matches made by correspondence. A disadvantage of trinocular stereo is that we have three cameras to calibrate. Uncalibrated stereo If our final aim is not to reconstruct the 3D scene with metric accuracy, i.e. to determine absolute values of scene coordinates, we can use uncalibrated stereo cameras. This means calculating the calibration parameters up to a scale factor from the positions of scene points. By calculating up to a scale factor we mean determining the relative positions of points but not the absolute dimensions. This is a familiar enough idea. Our own visual system, deprived of sufficient cues to estimate absolute depth, may not be able to tell the difference between a large object far away and a small object nearby, but we would be able to determine the 3D shape. How this is done takes us outside the scope of this course. In essence it involves making the scene coordinate system coincide with one of the camera coordinate systems (we might as well since we are not interested in absolute positions). This halves the number of extrinsic parameters to be determined. It can be shown that if we can reliably match eight points (at least) in each of the images, we can recover enough parameters directly from the scene itself to calculate relative 3D positions on the scene. The eight points need to avoid unlucky “degenerate” configurations, such as being coplanar. For many purposes this may be sufficient, particularly if other cues are available to estimate the scale factor. Reading for lecture 21 Jain Chapter 12 deals with calibration and the details of the rotation matrices in rather more detail than we require. !2.6 gives an algorithm for depth reconstruction Sonka Chapter 9 goes into 3D imaging in rather more mathematical detail than is required for our purposes. Imaging geometry is dealt with from the outset as a 3D vector problem. If you are happy with matrix algebra the arguments of this lecture and the next are presented rather concisely in that form. Section 9.1 broadens the scope of 3D vision to incorporate Marr’s proposals for representations in natural vision and the topic of active vision. Neither of these is necessary, but both are interesting. Sonka can be taken as an interesting, and mathematically complete, extension of the presentation in Jain and in the lectures, but is not necessary reading for our purposes.