CS3432: Lecture 20 Stereo Vision (part 1) Human vision has several remarkable properties. Amongst them is the ability to infer three dimensional interpretations of scenes from 2D retinal images. We use several cues to do this, including cues from the shading of objects to infer surface conformation. We can also use changes in the texture patterns of surfaces to infer their orientations in 3D. Computer vision methods have been developed to simulate these visual capacities. By far the most successful, and the most straightforward computer vision method for inferring three dimensions is stereopsis – the fact that two eyes (or cameras) taking slightly different views of a scene can provide accurate estimates of depth (distance from the viewer). Stereo vision is used in natural vision whenever precise location of objects in space is important. Computer vision systems employ stereo vision when it is necessary to make accurate 3D descriptions of scenes. Here we consider passive stereo vision systems, in which the sensors use the ambient reflected light from the scene. For some purposes it is possible to use active sensors that project lines or other patterns of light onto the scene to assist in 3D interpretation. Such approaches can be extremely important in certain classes of problem, such as inspection of manufactured components. (More in Jain section 11.4.) Stereo systems are useful in a variety of applications, such as: Remote sensing. Satellite images of earth (or other objects in the solar system) can be used in pairs to construct topographic maps of planetary surfaces. Unmanned exploratory vehicles on Mars of the Moon use pairs of cameras for stereo vision to make 3D maps of the terrain they find themselves in. Robotics. Robots may need to understand the spatial position of objects in their environment with high accuracy, and stereo vision can be used for this. Mobile autonomous vehicles need to be able to avoid obstacles and plan their paths through their environment. Precise 3D location of obstacles and other objects of interest is important. Inspection. 3D manufactured objects may require to be inspected for correct assembly. Visually the object must be matched with a 3D model of expected appearance, for which it is necessary to estimate the orientation of the object in space and the relative positions of its important components. The presentation of stereo vision here is very close to that in Jain, so these notes will deal in a summary fashion with matters that are discussed in Jain and concentrate on topics that are not. Stereo basics. We need a geometric camera model, for which the pinhole camera is used. The geometry is illustrated in the figure below. Light from a point P in the scene (a vector with scene coordinates X,Y,Z) is projected through the centre of projection, O, to form an image point p (a vector with image coordinates x, y). The image plane is conventionally shown in front of the centre of projection (also the scene origin in this diagram) to keep the image and scene coordinates compatible. The distance between O and the image plane is the focal length, f. Image Plane Y P X O Y p y f x The scene and image coordinates are related by the perspective transform X Z x= f f X and y = Y . (See Jain 1.4 for more details). Z Z If we could determine Z we could work out the scene coordinates corresponding to each image point, and have a complete description of the 3D scene. Binocular stereo provides us with a way of calculating Z. The diagram below shows a simple stereo geometry. Two cameras have parallel optic axes and are positioned so that there optical centres are on the same Y-coordinate, and are a distance b apart in X. We take the origin of the scene coordinates to be half way between the optical centres and refer to b as the baseline of the stereo system. Both cameras have focal length f. A scene point P projects into image points pl and pr with coordinates (xl, yl and xr, yr) in the left and right images respectively. b (baseline) Z Left image plane xl xr pl f Ol Scene origin Right image plane pr Z X Or By identifying similar triangles in the left and right images respectively we can deduce the following equations. b X −b xl X + 2 x 2 . From which xr − xl = b or Z = bf . and r = = Z Z f Z f f xr − xl The quantity xr − xl is called the disparity. It is the shift in position of a the projections of a scene point between two different views. In this case, since there is no separation between the optical centres in Y, there is no ycomponent of the disparity. If we know the baseline and the focal length, and can calculate the disparity it is easy to calculate Z (and hence X and Y). The problem of measuring b and f is that of calibration (of which more later). The most difficult problem in stereo is that of establishing point correspondences. The pair of image points pl and pr are called conjugate points. We need to find a method of establishing the correspondences. The Correspondence Problem The correspondence problem is the most difficult problem in stereo imaging. There is potentially a very large number of possible matches between the two images. Only by finding the correct conjugate pairs in this large space can we reconstruct the scene coordinates. The geometry comes to our aid. Since there is no disparity in y (in our simple system), then a match for any point on the left image must have the same y-coordinate in the right image. If we are searching for a match we need only search along the same y-line. This is known as the epipolar constraint, and the y-lines are called epipolar lines. Each point in one image defines an epipolar line in the other image. We shall see that for more realistic stereo geometries, the epipolar lines are not simply y-lines. However, this makes the search for a match a 1-D search. Not all image points will be useful for matching. We need points that show some image structure. Edge Matching We have already seen that edges represent significant image structure. There is sense in trying to match edge points in a pair of images. Since the number of edge points is much less than the total number of points there will be fewer possible matches and fewer ambiguities to resolve. The edges represent structurally important points in both images and are therefore useful scene coordinates to know. For each edge point on the left image (say), there are likely to be several candidate matching points on the epipolar line in the right image. Edge parameters can be used to disambiguate these possible matches. If we use an edge detector such as Canny, each edge point will be characterised by edge magnitude, polarity and direction. We can reject potential matches in which the polarity is reversed, the edge directions are significantly different and the magnitudes are very dissimilar. Using the Canny edge detector provides other advantages. Edges can be located at different scales, so the search for matches can be conducted using a coarse-to-fine scale strategy. This strategy usually results in more robust and faster search. Large-scale edges can be detected first. There will be fewer of these, so fewer ambiguities to be resolved using polarity or direction parameters. The positions of the coarse-scale matches can be used to narrow down the search area at finer scales. The non-maximal suppression element of the Canny operator can be used to locate edge positions with sub-pixel precision by fitting model edge profiles across the detected edge direction. Precise location of edges is important, since the accurate estimation of depth depends critically on finding the most accurate positions of the conjugate points. Correlation matching Edge points are not unique in the sense that neighbouring points along an edge would provide equally good matches in any pairing scheme. The correct match, of course, lies at the intersection of the edge with the epipolar line, but there are two problems. First, we do not know the epipolar lines with arbitrary accuracy. (As we shall see, the epipolar lines are not, in general parallel to the x-axis.) In searching for solutions we should be prepared to look at points close on either side of the epipolar line. In that case it may be very difficult to accurately determine the best matching point along an edge. The second difficulty is a corollary of this: it becomes more difficult to determine accurate disparity values for edges that are not perpendicular to the epipolar lines. At the extreme, it is impossible to generate accurate disparity values for points on horizontal edges. Another approach is to find matches between isolated salient points. The correspondence can be achieved by template matching (e.g. cross-correlation). However we need to select a sparse set of salient points to match, both to reduce the numbers of candidates and to improve the reliability of matching. It would be useless to try to generate matches in flat areas of image, for example. We need first of all to locate interesting points. We might characterise these as points where there is significant structure, but which do not lie on simple edges. We can outline two approaches. The Moravec Operator. (For detail, see Jain, 11.2.2). This is a non-linear filter using a support region (typically 5X5). Over the region the squares of the first pixel differences between neighbouring points (axial and diagonal) are summed, giving four values for the region. The output of the filter is the minimum of these values. (This ensures that the operator doesn’t just respond to an edge. There needs to be a measurable image gradient in all directions.) Locating local maxima in this output function results in isolated points that stand a good chance of resulting in robust correlations. Corner Detection There are several variants of corner detector, which have similarities to the Moravec operator. The following is easy to understand. At each point calculate the following matrix over some support region: ∂I 2 ∑ ∂x C= ∂I ∂I ∑ ∂x ∂y ∂I ∂I ∂x ∂y 2 ∂I ∑ ∂y ∑ The partial derivatives are just local image gradients, which could be obtained by pixel differencing or applying a gradient operator like Canny. The matrix captures the image gradient in all directions. Since it is symmetric it can be diagonalised by rotation to give: λ 0 C= 1 0 λ2 where λ1 and λ2 are eigenvalues. Conventionally λ1≥λ2. Each eigenvalue corresponds to an eigenvector which represents a principal gradient direction. If both λ1 and λ2 have significantly large values, then there are significant edges in two image directions (determined by the eigenvectors if we wanted to know), and the image point is on a corner. We just need to set a high enough threshold on λ2 to determine the corner positions. Matching Constraints With our best efforts to find salient and individual matching points we are still likely to need to choose between several candidate matches on an epipolar line. Solving this problem is a constraint satisfaction problem. We need to apply some constraints on the solution and find an optimal match by search or relaxation methods. There are several constraints that can apply. Uniqueness: Each point on the left image must match to at most one point on the right image. This is sensible if we are dealing with opaque surfaces (but not, say a goldfish in a bowl – let’s not worry about that). We need to allow for non-matches since points will be occluded differently in the two views. Smoothness: Points which are close to one another on the image are likely to be close in depth. Again this is reasonable if we are dealing with scenes composed of opaque surfaces. It is not true all of the time. Occluding edges occur at positions of depth discontinuity. Order: It is usually the case that the order of points along a line is maintained in the two images. It is always true for our simple parallel-gaze geometry, but not necessarily so in more general camera configurations. As an example of correspondence matching using constraints we consider a widely-known algorithm, the PMF algorithm (named after its developers Pollard, Mayhew and Frisby). This algorithm expresses smoothness as a limit on the disparity gradient. Consider two points a and b on left and right images as shown below. da=al-ar al ar sep bl Left image db=bl-br Cyclopean (average) image br Right image We invent the cyclopean or average image in which each point has the average of its positions in the left and right image. For each candidate pair of points we calculate its disparity (e.g. da). The disparity gradient between two points in the cyclopean image is simply the difference in their disparities divided by the separation (d a − d b ) . Psychophysical evidence suggests that human stereopsis does not reconstruct 3D sep interpretations using sets of points if the disparity gradient is >1.0. This simply states that large differences in disparity can be accepted provided the points are far enough apart. The PMF algorithm uses this limit on disparity gradient as a smoothness constraint. The algorithm proceeds by first finding edge primitives and selecting candidate matches along an epipolar line using edge polarity and direction criteria. It then calculates a matching score for each candidate match by examining the other potential matches in some local neighbourhood. This score is based on the number of other matches for which the disparity gradient with respect to the candidate is < 1.0. The best matches found in this way are put on a matched list and the algorithm proceeds by relaxation. Reading for lecture 20 Jain Section 1.4 describes the pinhole camera model and the perspective projection. Chapter 11 deals with the simple stereo geometry and disparity, correlation matching and the Moravec interest operator. Section 11.4 provides some discussion of other “shape from X” topics (X = texture, shading etc.) and range imaging (structured light). You should be able to understand the principles of these methods at the very superficial level that they are presented here. 11.5 goes into the topic of active vision, which is interesting, but beyond the scope of this course. Sonka Chapter 9 goes into 3D imaging in rather more mathematical detail than is required for our purposes. Imaging geometry is dealt with from the outset as a 3D vector problem. If you are happy with matrix algebra the arguments of this lecture and the next are presented rather concisely in that form. Section 9.1 broadens the scope of 3D vision to incorporate Marr’s proposals for representations in natural vision and the topic of active vision. Neither of these is necessary, but both are interesting. Sonka can be taken as an interesting, and mathematically complete, extension of the presentation in Jain and in the lectures, but is not necessary reading for our purposes.