CS3432: Lecture 21 Stereo Vision (part 2)

advertisement
CS3432: Lecture 21
Stereo Vision (part 2)
In the notes to lecture 20 we were introduced to the pinhole camera model and a simple (rather improbable) stereo
geometry. Here we generalise the geometry, and see that stereo reconstruction is still straightforward, but that we
need to calibrate the values of a rather larger number of parameters.
Stereo geometry revisited.
The general stereo geometry is shown in the figure below. The left and right image planes are no longer parallel
and aligned. Scene point P, a vector with scene coordinates(X,Y,Z), projects into image points pl and pr with
coordinates (xl, yl and xr, yr) in the coordinate systems of the left and right images respectively. The optical
centres Ol and Or are separated from the image planes by the focal lengths, fl and fr. Thus the image points, pl
and pr, can be represented by the three-dimensional coordinates (xl, yl, fl and xr, yr, fr). The vector to the scene
point in each of the camera coordinate systems is Pl and Pr respectively. Any point along the vector Pl, say, is
just a scalar multiple of pl. That is Pl = al p l , where al is a scalar. Similarly, Pr = a r p r .
P(X,Y,Z)
Pr(xr, yr, zr)
Left image plane
Pl(xl, yl, zl)
zl
fl
pl
Ol
yl xl
Right image plane
pr
Z
Tl
Rl
X
Y
Tr
fr
zr
Or y
r
Rr
xr
Coordinate transformations
In this more generalised geometry, we need to be much clearer about the coordinate systems being used than we
were in the simple case. The figure shows three coordinate systems. The coordinates of each of the left and right
cameras (xl, yl, zl and xr, yr, zr), and the scene (or world) coordinates (X,Y,Z). As we can see the origins of these
coordinate systems do not coincide, and their axes are not parallel. We need the scene coordinate system because
ultimately we want to know the positions in space of points in the scene independently of where we chose to put
the cameras. To do this we need to know the relationships between vectors in the scene coordinates system and
vectors in each of the camera coordinate systems. This relationship will be expressed as a translation vector
between the scene origin and the camera origin (Tl and Tr), and a 3D rotation matrix (Rl and Rr). These
transformations can be determined by calibration (below).
Stereo reconstruction
Reconstruction is simply a matter of finding the point of intersection of the vectors Pl and Pr. These, however are
represented in the left and right camera coordinates. We need the vectors expressed in scene coordinates. We can
write these as:
Pl′ = Tl + Rl Pl and Pr′ = Tr + Rl Pr .
The reconstructed point is the intersection of these two vectors. i.e. at the point where
Pl′ = Pr′
Since we know that Pl = al p l and Pr = a r p r we can find the point of intersection of the vectors by the
following set of linear equations:
Tl + al Rl p l = Tr + a r Rr p r .
Recall that pl and pr are the vectors (xl, yl, fl) and (xr, yr, fr). We can measure xl, yl, xr and yr. If we know the
coordinate transformations,(Tl, Tr, Rl and Rr) and the focal lengths (fl and fr) we can solve this set of linear
equations for al and ar. The next section indicates how we can determine these parameters.
The solution is slightly more complicated in practice. Because of measurement errors and calibration errors, the
projected vectors, Pl′ and Pr′ do not, in general, intersect in space. We need to find the mid-point of the shortest
line joining them. This can also be achieved by solving a system of linear equations.
Camera Calibration
The above equations relate the scene coordinates of points in space to the image coordinates of the projections of
those points in the left and right images. To solve them we need to know the camera parameters. We can
determine the camera parameters using the same relationships if we know accurate scene coordinates
corresponding to image points. We can classify camera parameters in two types, extrinsic and intrinsic,
Exntrinsic parameters
The extrinsic parameters relate the camera coordinate systems to the scene coordinates. They are the translation
vector T (Tx, Ty, Tz) and the rotation matrix R for each camera. R is a 3×3 matrix, but there are three free
parameters (the angles through which the coordinate system rotates).
Intrinsic parameters
Pixel grid
Image
Plane
Grid origin
dx
dy
Origin shift
Image Origin
sx
sy
Pixel
dimension
The intrinsic parameters are internal properties of the camera. One of these is the focal length f, which we need in
order to solve the reconstruction equations. There are others needed to relate the image coordinates to the pixel
coordinates measured by the detector. The image plane is a geometrical construct determined by the optics; the
sensor is a physical object located on the image plane (we hope). At the very least we need a scale factor to turn
the dimensionless pixel values into a meaningful unit of measurement, such as millimetres (the millimetre spacing
of the elements of the CCD camera). In practice we may need two scale factors, sx and sy, since the pixel spacing
in the two directions may be different. The origin of the image plane and the origin of the sensor would ideally
coincide, but in practice there may be a shift between them, requiring us to know two other parameters, dx and dy,
the offset of the sensor origin relative to the origin of the image plane. These parameters are illustrated in the
figure. The red rectangle represents the image plane. The black pixel grid is superimposed on it. The parameters
f, sx and sy are interdependent. We only need to know f and the ratio sx /sy.
For complete accuracy we need to consider further parameters that estimate the magnitude of distortions
introduced by the lenses. Here we are simply looking at the nature of the problem, so we won’t create further
complexity by considering these.
Calibration
There are a number of algorithms for determining these parameters. All of them need a calibration target – some
carefully manufactured object containing features that can be easily and accurately located in images, and whose
coordinates have been measured accurately relative to its own (scene) origin. The figure shows left and right
images of one form of calibration target.
This consists of a set of black squares on
a light background. The positions of the
corners of the squares have been
measured accurately. These points can be
located easily, and to sub-pixel precision,
using (say) a Canny edge detector.
Provided the planar target is oriented at a
suitable angle to both cameras, the
relationships between the 3D scene
coordinates and the image coordinates in
the two views can be used to calculate the
extrinsic and intrinsic parameters.
Different calibration algorithms are compromises between the accuracy of calibration required, the complexity of
the algorithm and the constraints on the target. The planar target shown in the figure is easy to manufacture and
measure. The algorithm that makes use of it is fairly straightforward, but there is a limit on the accuracy with
which the parameters can be determined. It turns out that this is accurate enough for most measurement purposes.
If very high accuracy is required, then more complex algorithms are used, involving multi-parameter non-linear
optimisation. These are less robust in terms of their tolerance to variation in target position and the accuracy with
which the target points are known. They also require 3D targets which are more difficult to engineer and measure.
Epipolar Geometry
In our simple stereo geometry we noted that corresponding points lie on the same y-line in the left and right
images. We called these epipolar lines. In our more general camera geometry, the epipolar lines are no longer
parallel to the image x-axes. The figure below shows the camera geometry again. This time the plane connecting
the scene point P to the optic centres of the two images, Ol and Or, has been shaded. This is the epipolar plane.
Clearly the two projections of P (pl and pr) both lie on this plane on the lines where it intersects the image planes.
These lines are the epipolar lines in this geometry.
The line connecting the optic centres, Ol and Or, intersects the left and right image planes at the epipoles (el and
er). This intersection may not be within the field of view (the image planes are geometrical constructs, infinite in
extent). The image planes are shown extended in the figure. The left epipole is the projection of the optic centre
of the right camera in the left camera’s image plane and vice versa. Notice that the line connecting Ol and Or is
always part of the epipolar plane for every scene point. That is, all epipolar lines in a given image pass through
the epipole.
Search for correspondences is still one-dimensional along the epipolar lines. For edge search it is not too much of
a problem that the epipolar lines are tilted with respect to the image axes. For correlation-based correspondence it
may be useful to rectify the images. Rectification performs a warping of one or both of the images to have the
effect of rotating the image planes to be parallel to each other, putting the epipoles at infinity and generating
epipolar lines parallel to the image x-axis. This makes search rather more convenient.
P(X,Y,Z)
Pl
Epipolar
plane
Pr
pl
pr
Ol
el
Epipolar lines
er
Or
Epipoles
A Stereo Inspection Example
The figures below (a,b) show stereo images of a 3D part (a component of a motor-car steering column). If we
wish to know the three dimensional orientation (pose) of this object, we could match a 3D model of its expected
appearance to the 3D positions of features in the scene. The model in this case consists of a set of circles and lines
in 3D corresponding to the important features of the object. The second pair of images (c,d) shows edges detected
in the stereo images with some epipolar lines superimposed. The epipolar lines are close to parallel to the x-axis
because the two cameras are situated close together looking at the object some distance away. The next pair of
images shows the 3D edge positions determined by stereo reconstruction (e) and some features (lines and circles)
extracted from these edges (f) to reduce the amount of uninformative data in the 3D scene. The last pictures (g,f)
show the projection of the 3D model onto the original 2D images after the pose has been refined.
a
b
e
f
c
d
g
h
Trinocular Stereo
There is no need to be limited to two cameras.
Indeed the correspondence problem can be made
more tractable by using three views (trinocular
stereo). In this case there are three sets of
epipolar lines. A point in image 1 corresponds
to lines in images 2 and 3. Each of those lines
projects to a point in the other images, so that
Epipolar lines
correspondence matching could, in principle, be
determined totally from epipolar geometry. In
practice where trinocular stereo is used, the third
epipolar line is used to verify matches made by correspondence.
A disadvantage of trinocular stereo is that we have three cameras to calibrate.
Uncalibrated stereo
If our final aim is not to reconstruct the 3D scene with metric accuracy, i.e. to determine absolute values of scene
coordinates, we can use uncalibrated stereo cameras. This means calculating the calibration parameters up to a
scale factor from the positions of scene points. By calculating up to a scale factor we mean determining the
relative positions of points but not the absolute dimensions. This is a familiar enough idea. Our own visual
system, deprived of sufficient cues to estimate absolute depth, may not be able to tell the difference between a
large object far away and a small object nearby, but we would be able to determine the 3D shape.
How this is done takes us outside the scope of this course. In essence it involves making the scene coordinate
system coincide with one of the camera coordinate systems (we might as well since we are not interested in
absolute positions). This halves the number of extrinsic parameters to be determined. It can be shown that if we
can reliably match eight points (at least) in each of the images, we can recover enough parameters directly from
the scene itself to calculate relative 3D positions on the scene. The eight points need to avoid unlucky
“degenerate” configurations, such as being coplanar. For many purposes this may be sufficient, particularly if
other cues are available to estimate the scale factor.
Reading for lecture 21
Jain
Chapter 12 deals with calibration and the details of the rotation matrices in rather more detail than we require.
!2.6 gives an algorithm for depth reconstruction
Sonka
Chapter 9 goes into 3D imaging in rather more mathematical detail than is required for our purposes. Imaging
geometry is dealt with from the outset as a 3D vector problem. If you are happy with matrix algebra the
arguments of this lecture and the next are presented rather concisely in that form. Section 9.1 broadens the scope
of 3D vision to incorporate Marr’s proposals for representations in natural vision and the topic of active vision.
Neither of these is necessary, but both are interesting. Sonka can be taken as an interesting, and mathematically
complete, extension of the presentation in Jain and in the lectures, but is not necessary reading for our purposes.
Download