CS3432: Lecture 20 Stereo Vision (part 1)

advertisement
CS3432: Lecture 20
Stereo Vision (part 1)
Human vision has several remarkable properties. Amongst them is the ability to infer three dimensional
interpretations of scenes from 2D retinal images. We use several cues to do this, including cues from the shading
of objects to infer surface conformation. We can also use changes in the texture patterns of surfaces to infer their
orientations in 3D. Computer vision methods have been developed to simulate these visual capacities. By far the
most successful, and the most straightforward computer vision method for inferring three dimensions is stereopsis
– the fact that two eyes (or cameras) taking slightly different views of a scene can provide accurate estimates of
depth (distance from the viewer). Stereo vision is used in natural vision whenever precise location of objects in
space is important. Computer vision systems employ stereo vision when it is necessary to make accurate 3D
descriptions of scenes. Here we consider passive stereo vision systems, in which the sensors use the ambient
reflected light from the scene. For some purposes it is possible to use active sensors that project lines or other
patterns of light onto the scene to assist in 3D interpretation. Such approaches can be extremely important in
certain classes of problem, such as inspection of manufactured components. (More in Jain section 11.4.)
Stereo systems are useful in a variety of applications, such as:
Remote sensing. Satellite images of earth (or other objects in the solar system) can be used in pairs to construct
topographic maps of planetary surfaces. Unmanned exploratory vehicles on Mars of the Moon use pairs of
cameras for stereo vision to make 3D maps of the terrain they find themselves in.
Robotics. Robots may need to understand the spatial position of objects in their environment with high accuracy,
and stereo vision can be used for this. Mobile autonomous vehicles need to be able to avoid obstacles and plan
their paths through their environment. Precise 3D location of obstacles and other objects of interest is important.
Inspection. 3D manufactured objects may require to be inspected for correct assembly. Visually the object must
be matched with a 3D model of expected appearance, for which it is necessary to estimate the orientation of the
object in space and the relative positions of its important components.
The presentation of stereo vision here is very close to that in Jain, so these notes will deal in a summary fashion
with matters that are discussed in Jain and concentrate on topics that are not.
Stereo basics.
We need a geometric camera model, for which the pinhole camera is used. The geometry is illustrated in the
figure below. Light from a point P in the scene (a vector with scene coordinates X,Y,Z) is projected through the
centre of projection, O, to form an image point p (a vector with image coordinates x, y). The image plane is
conventionally shown in front of the centre of projection (also the scene origin in this diagram) to keep the image
and scene coordinates compatible. The distance between O and the image plane is the focal length, f.
Image Plane
Y
P
X
O
Y
p
y
f
x
The scene and image coordinates are related by the perspective transform
X
Z
x=
f
f
X and y = Y . (See Jain 1.4 for more details).
Z
Z
If we could determine Z we could work out the scene coordinates corresponding to each image point, and have a
complete description of the 3D scene. Binocular stereo provides us with a way of calculating Z. The diagram
below shows a simple stereo geometry. Two cameras have parallel optic axes and are positioned so that there
optical centres are on the same Y-coordinate, and are a distance b apart in X. We take the origin of the scene
coordinates to be half way between the optical centres and refer to b as the baseline of the stereo system. Both
cameras have focal length f. A scene point P projects into image points pl and pr with coordinates (xl, yl and xr,
yr) in the left and right images respectively.
b
(baseline)
Z
Left image
plane
xl
xr
pl
f
Ol
Scene origin
Right image
plane
pr
Z
X
Or
By identifying similar triangles in the left and right images respectively we can deduce the following equations.
b
X −b
xl X + 2
x
2 . From which xr − xl = b or Z = bf .
and r =
=
Z
Z
f
Z
f
f
xr − xl
The quantity
xr − xl is called the disparity. It is the shift in position of a the projections of a scene point between
two different views. In this case, since there is no separation between the optical centres in Y, there is no ycomponent of the disparity. If we know the baseline and the focal length, and can calculate the disparity it is easy
to calculate Z (and hence X and Y). The problem of measuring b and f is that of calibration (of which more later).
The most difficult problem in stereo is that of establishing point correspondences. The pair of image points pl and
pr are called conjugate points. We need to find a method of establishing the correspondences.
The Correspondence Problem
The correspondence problem is the most difficult problem in stereo imaging. There is potentially a very large
number of possible matches between the two images. Only by finding the correct conjugate pairs in this large
space can we reconstruct the scene coordinates. The geometry comes to our aid. Since there is no disparity in y
(in our simple system), then a match for any point on the left image must have the same y-coordinate in the right
image. If we are searching for a match we need only search along the same y-line. This is known as the epipolar
constraint, and the y-lines are called epipolar lines. Each point in one image defines an epipolar line in the other
image. We shall see that for more realistic stereo geometries, the epipolar lines are not simply y-lines. However,
this makes the search for a match a 1-D search.
Not all image points will be useful for matching. We need points that show some image structure.
Edge Matching
We have already seen that edges represent significant image structure. There is sense in trying to match edge
points in a pair of images. Since the number of edge points is much less than the total number of points there will
be fewer possible matches and fewer ambiguities to resolve. The edges represent structurally important points in
both images and are therefore useful scene coordinates to know.
For each edge point on the left image (say), there are likely to be several candidate matching points on the
epipolar line in the right image. Edge parameters can be used to disambiguate these possible matches. If we use
an edge detector such as Canny, each edge point will be characterised by edge magnitude, polarity and direction.
We can reject potential matches in which the polarity is reversed, the edge directions are significantly different
and the magnitudes are very dissimilar. Using the Canny edge detector provides other advantages. Edges can be
located at different scales, so the search for matches can be conducted using a coarse-to-fine scale strategy. This
strategy usually results in more robust and faster search. Large-scale edges can be detected first. There will be
fewer of these, so fewer ambiguities to be resolved using polarity or direction parameters. The positions of the
coarse-scale matches can be used to narrow down the search area at finer scales. The non-maximal suppression
element of the Canny operator can be used to locate edge positions with sub-pixel precision by fitting model edge
profiles across the detected edge direction. Precise location of edges is important, since the accurate estimation of
depth depends critically on finding the most accurate positions of the conjugate points.
Correlation matching
Edge points are not unique in the sense that neighbouring points along an edge would provide equally good
matches in any pairing scheme. The correct match, of course, lies at the intersection of the edge with the epipolar
line, but there are two problems. First, we do not know the epipolar lines with arbitrary accuracy. (As we shall
see, the epipolar lines are not, in general parallel to the x-axis.) In searching for solutions we should be prepared
to look at points close on either side of the epipolar line. In that case it may be very difficult to accurately
determine the best matching point along an edge. The second difficulty is a corollary of this: it becomes more
difficult to determine accurate disparity values for edges that are not perpendicular to the epipolar lines. At the
extreme, it is impossible to generate accurate disparity values for points on horizontal edges.
Another approach is to find matches between isolated salient points. The correspondence can be achieved by
template matching (e.g. cross-correlation). However we need to select a sparse set of salient points to match, both
to reduce the numbers of candidates and to improve the reliability of matching. It would be useless to try to
generate matches in flat areas of image, for example. We need first of all to locate interesting points. We might
characterise these as points where there is significant structure, but which do not lie on simple edges. We can
outline two approaches.
The Moravec Operator.
(For detail, see Jain, 11.2.2). This is a non-linear filter using a support region (typically 5X5). Over the region the
squares of the first pixel differences between neighbouring points (axial and diagonal) are summed, giving four
values for the region. The output of the filter is the minimum of these values. (This ensures that the operator
doesn’t just respond to an edge. There needs to be a measurable image gradient in all directions.) Locating local
maxima in this output function results in isolated points that stand a good chance of resulting in robust
correlations.
Corner Detection
There are several variants of corner detector, which have similarities to the Moravec operator. The following is
easy to understand. At each point calculate the following matrix over some support region:
 ∂I 2
 ∑
∂x
C=
 ∂I ∂I
∑ ∂x ∂y

∂I ∂I 

∂x ∂y 
2
∂I 
∑
∂y 
∑
The partial derivatives are just local image gradients, which could be obtained by pixel differencing or applying a
gradient operator like Canny. The matrix captures the image gradient in all directions. Since it is symmetric it
can be diagonalised by rotation to give:
λ 0 
C= 1

 0 λ2 
where λ1 and λ2 are eigenvalues. Conventionally λ1≥λ2. Each eigenvalue corresponds to an eigenvector which
represents a principal gradient direction. If both λ1 and λ2 have significantly large values, then there are significant
edges in two image directions (determined by the eigenvectors if we wanted to know), and the image point is on a
corner. We just need to set a high enough threshold on λ2 to determine the corner positions.
Matching Constraints
With our best efforts to find salient and individual matching points we are still likely to need to choose between
several candidate matches on an epipolar line. Solving this problem is a constraint satisfaction problem. We need
to apply some constraints on the solution and find an optimal match by search or relaxation methods. There are
several constraints that can apply.
Uniqueness: Each point on the left image must match to at most one point on the right image. This is sensible if
we are dealing with opaque surfaces (but not, say a goldfish in a bowl – let’s not worry about that). We need to
allow for non-matches since points will be occluded differently in the two views.
Smoothness: Points which are close to one another on the image are likely to be close in depth. Again this is
reasonable if we are dealing with scenes composed of opaque surfaces. It is not true all of the time. Occluding
edges occur at positions of depth discontinuity.
Order: It is usually the case that the order of points along a line is maintained in the two images. It is always true
for our simple parallel-gaze geometry, but not necessarily so in more general camera configurations.
As an example of correspondence matching using constraints we consider a widely-known algorithm, the PMF
algorithm (named after its developers Pollard, Mayhew and Frisby). This algorithm expresses smoothness as a
limit on the disparity gradient. Consider two points a and b on left and right images as shown below.
da=al-ar
al
ar
sep
bl
Left image
db=bl-br
Cyclopean (average) image
br
Right image
We invent the cyclopean or average image in which each point has the average of its positions in the left and right
image. For each candidate pair of points we calculate its disparity (e.g. da). The disparity gradient between two
points in the cyclopean image is simply the difference in their disparities divided by the separation
 (d a − d b )
 . Psychophysical evidence suggests that human stereopsis does not reconstruct 3D


sep


interpretations using sets of points if the disparity gradient is >1.0. This simply states that large differences in
disparity can be accepted provided the points are far enough apart. The PMF algorithm uses this limit on disparity
gradient as a smoothness constraint.
The algorithm proceeds by first finding edge primitives and selecting candidate matches along an epipolar line
using edge polarity and direction criteria. It then calculates a matching score for each candidate match by
examining the other potential matches in some local neighbourhood. This score is based on the number of other
matches for which the disparity gradient with respect to the candidate is < 1.0. The best matches found in this
way are put on a matched list and the algorithm proceeds by relaxation.
Reading for lecture 20
Jain
Section 1.4 describes the pinhole camera model and the perspective projection. Chapter 11 deals with the simple
stereo geometry and disparity, correlation matching and the Moravec interest operator. Section 11.4 provides
some discussion of other “shape from X” topics (X = texture, shading etc.) and range imaging (structured light).
You should be able to understand the principles of these methods at the very superficial level that they are
presented here. 11.5 goes into the topic of active vision, which is interesting, but beyond the scope of this course.
Sonka
Chapter 9 goes into 3D imaging in rather more mathematical detail than is required for our purposes. Imaging
geometry is dealt with from the outset as a 3D vector problem. If you are happy with matrix algebra the
arguments of this lecture and the next are presented rather concisely in that form. Section 9.1 broadens the scope
of 3D vision to incorporate Marr’s proposals for representations in natural vision and the topic of active vision.
Neither of these is necessary, but both are interesting. Sonka can be taken as an interesting, and mathematically
complete, extension of the presentation in Jain and in the lectures, but is not necessary reading for our purposes.
Download