Multi-view Stereo Image Synthesis from Two

advertisement
Multi-view Stereo Image Synthesis from Two-view Stereo Images
Yan-Hsiang Huang
Yu-Hsiang Huang
Ming-Hung Tsai
National Taiwan University
{princeyan, edwardhw, allcyril}@cmlab.csie.ntu.edu.tw
Abstract
Nowadays, stereo displays can let the viewer experience 3D
effects without wearing special glasses. These displays are called
the autostereocopic displays or the glasses-free 3D displays.
However, with only two views, viewers will have limited stereo
experience since there is no motion parallax. Multi-view
autostereocopic displays are then developed to solve this problem.
These kinds of displays show numbers of views from different
viewing angles at once. With optical or physical barrier, one can
see only the content of a certain viewing angle, and therefore one
can sense the motion parallax.
To make use of this technology, we need to have multi-view
contents to display. Unfortunately, such content are hard to
obtain. Even the stereo camera available today can only take two
views at once. Therefore, the main goal of our project is to create
multi-view contents from two-view contents captures by using
stereo cameras. Also, we make use of the computing power of the
GPUs to accelerate the process.
1. Introduction
In general, despite the two view stereo photo we take, we still
have to use the depth information to recover the images from
different viewing angles. As long as we have the depth of whole
scene, we can compute the disparity among the original photo and
the synthesized image. However, it is difficult to get the depth
information. Usually, we need human input to do segmentation
and assign depth value, so it can’t be done automatically. Even if
we have the depth information, the occlusion problem is also a big
issue.
We come up with a new method which uses warping to synthesis
images from different viewing angle. This technique requires no
depth map, so it is free from solving occlusion problem. We
extract the matched feature points between the two view stereo
photos, so as to estimate the location of feature points seeing from
different viewing angle. The warping is guided by these feature
points. Nevertheless, to avoid distorting the content of original
photo, we have to add additional constraints. For example, we
deform the image into quad mesh, and warp the grid points
according to the estimated feature location, while we want to
preserve the content simultaneously. We achieve this by designing
an energy function and minimizing it by solving a linear system.
Although there are some artifacts in the occlusion region, the
overall result quality is acceptable. It seems that everything is
wonderful, yet the main drawback of this approach is the high
computation cost of the content preserving warp.
triangulation on feature points, we can move the feature point and
change the triangle mesh. Eventually, we just do texture mapping
to get the synthesized image. Nonetheless, the problem of content
preserving appears. The intuitively way is that the smaller triangle
mesh we use, the less distortion we make. Therefore, we use
dense feature rather than sparse feature to generate more small
triangles. In this case, we spend most of time extracting dense
feature, so we use GPU to accelerate this part.
2. System Overview
Input Two-view Stereo Image
Feature Extraction and Matching
Compute Feature Coordinates for Each View
Triangulation by Features
Texture Mapping fir Each View
Output Multi-view Stereo Image
Figure 1. The flowchart of converting two-view stereo image to
multi-view stereo image.
We first extract the feature points of two images, which we called
left-eyed image and right-eyed image. To preserve the content of
the original two-view stereo image, we not only apply the dense
interest point algorithm as mentioned before, but also use some
features on the edge. Therefore, we would get a tremendous
amount of features and it takes lots of time finding the matching
relation among these features. It will be discussed latter. Second,
with these matched feature points, we would like to estimate the
location of each feature point on the other view. The simple and
useful way is just doing interpolation and extrapolation on the
matched feature points. Third, instead of warping, we deform two
images into triangle mesh; vertices of triangle come from feature
points. Finally, we just have to move the triangle as a texture onto
the virtual view image according the estimate feature points
coordinates.
3. Implementation
In order to move the feature point to the desired location, we warp
the grid points of quad mesh, the deformed image. Since it is
time-consuming to solve a linear system, we would like to skip
the content preserving warp. It is more directly that we let the
feature point move to exactly the location we estimated. To
achieve this, we use triangle mesh instead. By applying Delaunay
Base on the concept of Figure1, the implementation detail are
explained below.
3.1 Sparse Stereo Correspondences
(a) SURF
(b) Dense interes points (DIP)
(c) DIP, Canny, LSD
(d) The triangle mesh of (a)
(e) The triangle mesh of (b)
(f)
The triangle mesh of (c)
Figure 2. The first row is the feature extraction and matching result using different features. The second row is the result of triangulation.
The method we use (DIP + Canny + LSD) performs greatly in having well matches in both textured and texture-less region. Also, it cuts
out the object contours well.
To synthesize the virtual view by warping, we need stereo
correspondence to guide the warping. The most popular way for
stereo view synthesize is to calculate dense stereo
correspondence, such as depth map or disparity map. In our
project, we need only sparse stereo correspondence, which is
feature correspondence between the input images.
The standard feature extraction method such as SIFT or SURF
finds good features. But such methods will find few features in
the texture-less region, and this will cause serious artifact in our
work since those region will be seriously distorted or simply left
blanked. Therefore, we applied a dense interest point (DIP)
algorithm which combines the advantage of uniform sampling and
standard feature extraction. That is, finding features more
uniformly over the image while maintaining the quality of the
extracted features.
However, we found with only dense interest points, the result are
not good enough because of the distortion and discontinuity near
the edges or the object contours. We solve this problem by finding
for features along the contours. We first apply LSD algorithm to
find good edge segments and sample several points on the
segments. Next, we enhance the result by finding more features
by sampling on the edge map computed by Canny edge detection
algorithm. Figure 2 shows the feature points we extract.
Feature extraction
For edge features, by applying LSD
algorithm we can get the points at the two ends of line, we still
need other features on the line. If we take all the points on the line
as feature, the triangulation result can obviously deform the image
along the border of every region, and the output result if texture
mapping can be fabulous. Unfortunately, it takes lot of time to
extract feature descriptor and match the feature. Therefore, we
adapt sampling some points on the line as the line feature. In our
implementation, if the distance between the two ends of line is
larger than 5, than we pick up the middle points, and recursively
check the middle points and each of the two ends points of the
line. Our goal is to sample the feature points on the edge
uniformly and not too close to each other.
We also use canny edge detection to get the possible edge on the
image. The way we find the edge feature is also sampling. For
each point on the edge, we sample the point which sum of x
coordinate and the y coordinate is a multiple of 5. This method
can avoid get points close to each other and reduce the edge
feature amount.
Figure 3. The matching feature, red means dense interest points,
green means features sampled from canny edge detection, and
blue means features sampled from LSD.
Descriptor Extraction
We adapt SURF descriptor because
it’s outstanding performance in many aspects. Consider the
performance of matching, the 128-dimension SIFT descriptor is
more than the 64-dimension SURF descriptor, but there is no
obvious difference between both result. Since we have a giant
amount of interest points, and descriptor extraction time is the
bottleneck of our program, we decide to use GPU computing the
SURF descriptor parallel.
Feature Matching
This part needs a huge computation
time; we use Kd-tree to speed up the matching. There is no
particular model between the two view images, so we cannot
apply RANSAC algorithm to separate the inlier and the outlier.
What we can do is check y coordinate of the two matching points,
and remove the match which y coordinate is too much different.
Figure 3 shows the matching result.
Outlier Filtering
We have two parts to do outlier
filtering. First, since the extracted features we used are not as
distinctive as the features extracted using standard feature
detection algorithm, the matching result may have many outliers.
Those outliers can cause serious artifacts such as distortion or
flipping artifacts. Therefore, we come up with an outlier filtering
method that can remove outliers. The main idea is that we take
those feature pairs that are too different from their neighbors as
outliers. Also, the matching pairs should follow certain order so
that the flipping artifact can be soothed.
given points, that is vertices of the triangle are all these given
points. By applying Delaunay triangulation and the tremendous
amount of edge feature, the triangle seems to separate the region
clearly. In our implementation, we do Delaunay triangulation on
left-eyed image and right-eyed image independently, thus there is
no relation between the two triangle mesh. Figure 2 illustrates the
triangulation result.
3.3 Interpolation and Extraction
The glass free 3D display needs multi-view content. In our
implementation, we use an eight-view displayer, so we need to
synthesis six virtual views from the original two-view stereo
image. The virtual views are arranged like Figure4.
First we filter out those outlier matches with the following
constraint:
𝑦 ′ −𝑦
| ′ | > 𝑀,
𝑥 −𝑥
where (𝑥, 𝑦) is the feature on the left eye image of a matching
pair, and (𝑥 ′ , 𝑦 ′ ) is the feature on the left eye image of the same
matching pair. For our test cases, we typically use 0.25 for 𝑀.
This method can remove those obvious outliers.
For each matched feature pairs, we compute mean(𝑒𝑎𝑛) , and
standard deviation 𝑠𝑡𝑑𝐷𝑒𝑣 of distance between features point of a
match. Then we use the following constraint to filter the rest of
the matching pairs, those who cannot satisfy the constraint are
deemed to be outliers:
‖(𝑥 ′ , 𝑦 ′ ) − (𝑥, 𝑦)‖ > 𝑚𝑒𝑎𝑛 + 𝑡 ∗ 𝑠𝑡𝑑𝐷𝑒𝑣
{
,
‖(𝑥 ′ , 𝑦 ′ ) − (𝑥, 𝑦)‖ < 𝑚𝑒𝑎𝑛 − 𝑡 ∗ 𝑠𝑡𝑑𝐷𝑒𝑣
where (𝑥, 𝑦) is the feature on the left eye image of a matching
pair, and (𝑥 ′ , 𝑦 ′ ) is the feature on the left eye image of the same
matching pair. For our test cases, we typically use 1.5 for 𝑀.
Now we have made a wide gap between the portion of inlier and
outlier in the first part, it is easier to check the geometry
relationship among all the inlier. We use the sequence order as
constraint, examine each feature point and remove features which
order is different from neighbors. In our implementation, we
choose a particular match inlier called A and A’ and random pick
another match inlier called B and B’. If the relative position of left
and right and the relative position of upper and lower is the same
between the A, B and A’, B’, then we check the next inlier match,
otherwise, we filter out this inlier match. We check all inlier
matches as a round, and repeat the round for thirty times.
3.2 Delaunay Triangulation
We usually deform image into quad mesh before warping. There
are a lot of advantages of the quad mesh. For example, the whole
structure is regular, and it is easy to do some arrangement and
maintain because of the predictability. So at first we do not even
think about using the other polygon mesh other than quad mesh.
However, we indeed discover that even though warping can avoid
solving occlusion problem, the limitation is that warping cannot
separate the region, and thus caused some uncomfortable
distortion around some serious occlusion region. It is important to
cut out each region and warping. The depth map for DIBR (depth
image based rendering) is based on region. What we can improve
is distinguish different region and give some constraint. Just
because we need to break region, there makes no sense to
preserve the quad structure.
Consequently, we decide using triangle mesh. Given points on a
2D image, we want to do triangulation on this image based the
1
2
3
4
5
6
7
8
Figure 4. The red circles are original two view image, and the
purple circles are the six virtual view images.
The eight views are arranged in a straight line, and the distance
between two adjacent views is the same. We can utilize this
arrangement and simply using interpolation and extrapolation to
get the feature points coordinates in the virtual views. It seems to
be a rough estimation, but we have check some ground truth
stereo images on the website, the disparity between each two
adjacent views are almost the same. So it is appropriate to use
interpolation and extrapolation to estimate the location of feature
points.
3.4 Texture Mapping
After we do triangulation on the two images, each triangle can be
stored as texture. There are two set of texture because the triangle
mesh is different between the left-eyed image and the right-eyed
image. Since all the feature points coordinate in the virtual views
are known in the previous step, we can warp the image by pasting
the texture to its corresponding location. In our implementation,
virtual view 1, 2, and 4 are synthesized by the left-eye image,
which is view3. Virtual view 5, 7, and 8 are synthesized by the
right-eyed image, which is view 6. That is to say, the four left
view images share the same triangle mesh, while the four right
view images share the other same triangle mesh.
4. Result
Figure 5 shows some results of our eight view images.
5. Conclusion
In this project, we present a system that can synthesize virtual
views along the baseline of the input two-view stereo image pair.
Compared with the majority of view synthesis algorithms today,
our work is a warp-based algorithm that warps the virtual view
with the guide of the feature correspondence between input image
pairs. Therefore, it does not need any depth information, which is
hard to compute accurately and its computation is time
consuming.
Since we applied a dense feature extraction algorithm and
extracting more feature on the edges, the matched feature pairs
can distribute quite uniformly over the image with many of them
on the edges. With this method, the triangulation result can cut out
the object contour well and so the results are less distorted than
those warp-based method using uniform quad grids.
References
Before this report is done, our system has only accelerated with
GPU in the feature descriptor extraction and matching part.
Further acceleration is possible, such as the outlier filtering part
and triangulation part. We leave them as a future work since they
are not as time-consuming as the extraction and matching.
The SURF descriptor we used too expensive in our cases since the
input two-view stereo images are not so different to each other.
Possible alternatives are simplified descriptors or learning-based
descriptors which take input image pair as training data. For the
triangulation part, we can use constrained Delaunay triangulation
combined with LSD and certain line matching algorithm in order
to better cut out the object edges, and thus may have better results.
[1] Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., 2008.
SURF: Speeded Up Robust Features. Computer Vision and Image
Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008.
[2] Canny, J., A Computational Approach to Edge Detection,
IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6):679–
698, 1986.
[3] Grompone von Gioi, R., Jakubowicz, J., Morel, J.-M., and
Randall, G., 2008. LSD: A Fast Line Segment Detector with a
False Detection Control. IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 32, no. 4, pp. 722-732, April 2010.
[4] Tinne Tuytelaars, 2010. Dense Interest Points. In CVPR,
2010.
(a) Balls
(b) Dino
(c) Dog
(d) Fighter
(e) Ship
Figure 5. Five sets of eight view images. The viewing angle if from left to right.
Download