Multi-view Stereo Image Synthesis from Two-view Stereo Images Yan-Hsiang Huang Yu-Hsiang Huang Ming-Hung Tsai National Taiwan University {princeyan, edwardhw, allcyril}@cmlab.csie.ntu.edu.tw Abstract Nowadays, stereo displays can let the viewer experience 3D effects without wearing special glasses. These displays are called the autostereocopic displays or the glasses-free 3D displays. However, with only two views, viewers will have limited stereo experience since there is no motion parallax. Multi-view autostereocopic displays are then developed to solve this problem. These kinds of displays show numbers of views from different viewing angles at once. With optical or physical barrier, one can see only the content of a certain viewing angle, and therefore one can sense the motion parallax. To make use of this technology, we need to have multi-view contents to display. Unfortunately, such content are hard to obtain. Even the stereo camera available today can only take two views at once. Therefore, the main goal of our project is to create multi-view contents from two-view contents captures by using stereo cameras. Also, we make use of the computing power of the GPUs to accelerate the process. 1. Introduction In general, despite the two view stereo photo we take, we still have to use the depth information to recover the images from different viewing angles. As long as we have the depth of whole scene, we can compute the disparity among the original photo and the synthesized image. However, it is difficult to get the depth information. Usually, we need human input to do segmentation and assign depth value, so it can’t be done automatically. Even if we have the depth information, the occlusion problem is also a big issue. We come up with a new method which uses warping to synthesis images from different viewing angle. This technique requires no depth map, so it is free from solving occlusion problem. We extract the matched feature points between the two view stereo photos, so as to estimate the location of feature points seeing from different viewing angle. The warping is guided by these feature points. Nevertheless, to avoid distorting the content of original photo, we have to add additional constraints. For example, we deform the image into quad mesh, and warp the grid points according to the estimated feature location, while we want to preserve the content simultaneously. We achieve this by designing an energy function and minimizing it by solving a linear system. Although there are some artifacts in the occlusion region, the overall result quality is acceptable. It seems that everything is wonderful, yet the main drawback of this approach is the high computation cost of the content preserving warp. triangulation on feature points, we can move the feature point and change the triangle mesh. Eventually, we just do texture mapping to get the synthesized image. Nonetheless, the problem of content preserving appears. The intuitively way is that the smaller triangle mesh we use, the less distortion we make. Therefore, we use dense feature rather than sparse feature to generate more small triangles. In this case, we spend most of time extracting dense feature, so we use GPU to accelerate this part. 2. System Overview Input Two-view Stereo Image Feature Extraction and Matching Compute Feature Coordinates for Each View Triangulation by Features Texture Mapping fir Each View Output Multi-view Stereo Image Figure 1. The flowchart of converting two-view stereo image to multi-view stereo image. We first extract the feature points of two images, which we called left-eyed image and right-eyed image. To preserve the content of the original two-view stereo image, we not only apply the dense interest point algorithm as mentioned before, but also use some features on the edge. Therefore, we would get a tremendous amount of features and it takes lots of time finding the matching relation among these features. It will be discussed latter. Second, with these matched feature points, we would like to estimate the location of each feature point on the other view. The simple and useful way is just doing interpolation and extrapolation on the matched feature points. Third, instead of warping, we deform two images into triangle mesh; vertices of triangle come from feature points. Finally, we just have to move the triangle as a texture onto the virtual view image according the estimate feature points coordinates. 3. Implementation In order to move the feature point to the desired location, we warp the grid points of quad mesh, the deformed image. Since it is time-consuming to solve a linear system, we would like to skip the content preserving warp. It is more directly that we let the feature point move to exactly the location we estimated. To achieve this, we use triangle mesh instead. By applying Delaunay Base on the concept of Figure1, the implementation detail are explained below. 3.1 Sparse Stereo Correspondences (a) SURF (b) Dense interes points (DIP) (c) DIP, Canny, LSD (d) The triangle mesh of (a) (e) The triangle mesh of (b) (f) The triangle mesh of (c) Figure 2. The first row is the feature extraction and matching result using different features. The second row is the result of triangulation. The method we use (DIP + Canny + LSD) performs greatly in having well matches in both textured and texture-less region. Also, it cuts out the object contours well. To synthesize the virtual view by warping, we need stereo correspondence to guide the warping. The most popular way for stereo view synthesize is to calculate dense stereo correspondence, such as depth map or disparity map. In our project, we need only sparse stereo correspondence, which is feature correspondence between the input images. The standard feature extraction method such as SIFT or SURF finds good features. But such methods will find few features in the texture-less region, and this will cause serious artifact in our work since those region will be seriously distorted or simply left blanked. Therefore, we applied a dense interest point (DIP) algorithm which combines the advantage of uniform sampling and standard feature extraction. That is, finding features more uniformly over the image while maintaining the quality of the extracted features. However, we found with only dense interest points, the result are not good enough because of the distortion and discontinuity near the edges or the object contours. We solve this problem by finding for features along the contours. We first apply LSD algorithm to find good edge segments and sample several points on the segments. Next, we enhance the result by finding more features by sampling on the edge map computed by Canny edge detection algorithm. Figure 2 shows the feature points we extract. Feature extraction For edge features, by applying LSD algorithm we can get the points at the two ends of line, we still need other features on the line. If we take all the points on the line as feature, the triangulation result can obviously deform the image along the border of every region, and the output result if texture mapping can be fabulous. Unfortunately, it takes lot of time to extract feature descriptor and match the feature. Therefore, we adapt sampling some points on the line as the line feature. In our implementation, if the distance between the two ends of line is larger than 5, than we pick up the middle points, and recursively check the middle points and each of the two ends points of the line. Our goal is to sample the feature points on the edge uniformly and not too close to each other. We also use canny edge detection to get the possible edge on the image. The way we find the edge feature is also sampling. For each point on the edge, we sample the point which sum of x coordinate and the y coordinate is a multiple of 5. This method can avoid get points close to each other and reduce the edge feature amount. Figure 3. The matching feature, red means dense interest points, green means features sampled from canny edge detection, and blue means features sampled from LSD. Descriptor Extraction We adapt SURF descriptor because it’s outstanding performance in many aspects. Consider the performance of matching, the 128-dimension SIFT descriptor is more than the 64-dimension SURF descriptor, but there is no obvious difference between both result. Since we have a giant amount of interest points, and descriptor extraction time is the bottleneck of our program, we decide to use GPU computing the SURF descriptor parallel. Feature Matching This part needs a huge computation time; we use Kd-tree to speed up the matching. There is no particular model between the two view images, so we cannot apply RANSAC algorithm to separate the inlier and the outlier. What we can do is check y coordinate of the two matching points, and remove the match which y coordinate is too much different. Figure 3 shows the matching result. Outlier Filtering We have two parts to do outlier filtering. First, since the extracted features we used are not as distinctive as the features extracted using standard feature detection algorithm, the matching result may have many outliers. Those outliers can cause serious artifacts such as distortion or flipping artifacts. Therefore, we come up with an outlier filtering method that can remove outliers. The main idea is that we take those feature pairs that are too different from their neighbors as outliers. Also, the matching pairs should follow certain order so that the flipping artifact can be soothed. given points, that is vertices of the triangle are all these given points. By applying Delaunay triangulation and the tremendous amount of edge feature, the triangle seems to separate the region clearly. In our implementation, we do Delaunay triangulation on left-eyed image and right-eyed image independently, thus there is no relation between the two triangle mesh. Figure 2 illustrates the triangulation result. 3.3 Interpolation and Extraction The glass free 3D display needs multi-view content. In our implementation, we use an eight-view displayer, so we need to synthesis six virtual views from the original two-view stereo image. The virtual views are arranged like Figure4. First we filter out those outlier matches with the following constraint: 𝑦 ′ −𝑦 | ′ | > 𝑀, 𝑥 −𝑥 where (𝑥, 𝑦) is the feature on the left eye image of a matching pair, and (𝑥 ′ , 𝑦 ′ ) is the feature on the left eye image of the same matching pair. For our test cases, we typically use 0.25 for 𝑀. This method can remove those obvious outliers. For each matched feature pairs, we compute mean(𝑒𝑎𝑛) , and standard deviation 𝑠𝑡𝑑𝐷𝑒𝑣 of distance between features point of a match. Then we use the following constraint to filter the rest of the matching pairs, those who cannot satisfy the constraint are deemed to be outliers: ‖(𝑥 ′ , 𝑦 ′ ) − (𝑥, 𝑦)‖ > 𝑚𝑒𝑎𝑛 + 𝑡 ∗ 𝑠𝑡𝑑𝐷𝑒𝑣 { , ‖(𝑥 ′ , 𝑦 ′ ) − (𝑥, 𝑦)‖ < 𝑚𝑒𝑎𝑛 − 𝑡 ∗ 𝑠𝑡𝑑𝐷𝑒𝑣 where (𝑥, 𝑦) is the feature on the left eye image of a matching pair, and (𝑥 ′ , 𝑦 ′ ) is the feature on the left eye image of the same matching pair. For our test cases, we typically use 1.5 for 𝑀. Now we have made a wide gap between the portion of inlier and outlier in the first part, it is easier to check the geometry relationship among all the inlier. We use the sequence order as constraint, examine each feature point and remove features which order is different from neighbors. In our implementation, we choose a particular match inlier called A and A’ and random pick another match inlier called B and B’. If the relative position of left and right and the relative position of upper and lower is the same between the A, B and A’, B’, then we check the next inlier match, otherwise, we filter out this inlier match. We check all inlier matches as a round, and repeat the round for thirty times. 3.2 Delaunay Triangulation We usually deform image into quad mesh before warping. There are a lot of advantages of the quad mesh. For example, the whole structure is regular, and it is easy to do some arrangement and maintain because of the predictability. So at first we do not even think about using the other polygon mesh other than quad mesh. However, we indeed discover that even though warping can avoid solving occlusion problem, the limitation is that warping cannot separate the region, and thus caused some uncomfortable distortion around some serious occlusion region. It is important to cut out each region and warping. The depth map for DIBR (depth image based rendering) is based on region. What we can improve is distinguish different region and give some constraint. Just because we need to break region, there makes no sense to preserve the quad structure. Consequently, we decide using triangle mesh. Given points on a 2D image, we want to do triangulation on this image based the 1 2 3 4 5 6 7 8 Figure 4. The red circles are original two view image, and the purple circles are the six virtual view images. The eight views are arranged in a straight line, and the distance between two adjacent views is the same. We can utilize this arrangement and simply using interpolation and extrapolation to get the feature points coordinates in the virtual views. It seems to be a rough estimation, but we have check some ground truth stereo images on the website, the disparity between each two adjacent views are almost the same. So it is appropriate to use interpolation and extrapolation to estimate the location of feature points. 3.4 Texture Mapping After we do triangulation on the two images, each triangle can be stored as texture. There are two set of texture because the triangle mesh is different between the left-eyed image and the right-eyed image. Since all the feature points coordinate in the virtual views are known in the previous step, we can warp the image by pasting the texture to its corresponding location. In our implementation, virtual view 1, 2, and 4 are synthesized by the left-eye image, which is view3. Virtual view 5, 7, and 8 are synthesized by the right-eyed image, which is view 6. That is to say, the four left view images share the same triangle mesh, while the four right view images share the other same triangle mesh. 4. Result Figure 5 shows some results of our eight view images. 5. Conclusion In this project, we present a system that can synthesize virtual views along the baseline of the input two-view stereo image pair. Compared with the majority of view synthesis algorithms today, our work is a warp-based algorithm that warps the virtual view with the guide of the feature correspondence between input image pairs. Therefore, it does not need any depth information, which is hard to compute accurately and its computation is time consuming. Since we applied a dense feature extraction algorithm and extracting more feature on the edges, the matched feature pairs can distribute quite uniformly over the image with many of them on the edges. With this method, the triangulation result can cut out the object contour well and so the results are less distorted than those warp-based method using uniform quad grids. References Before this report is done, our system has only accelerated with GPU in the feature descriptor extraction and matching part. Further acceleration is possible, such as the outlier filtering part and triangulation part. We leave them as a future work since they are not as time-consuming as the extraction and matching. The SURF descriptor we used too expensive in our cases since the input two-view stereo images are not so different to each other. Possible alternatives are simplified descriptors or learning-based descriptors which take input image pair as training data. For the triangulation part, we can use constrained Delaunay triangulation combined with LSD and certain line matching algorithm in order to better cut out the object edges, and thus may have better results. [1] Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L., 2008. SURF: Speeded Up Robust Features. Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008. [2] Canny, J., A Computational Approach to Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6):679– 698, 1986. [3] Grompone von Gioi, R., Jakubowicz, J., Morel, J.-M., and Randall, G., 2008. LSD: A Fast Line Segment Detector with a False Detection Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 4, pp. 722-732, April 2010. [4] Tinne Tuytelaars, 2010. Dense Interest Points. In CVPR, 2010. (a) Balls (b) Dino (c) Dog (d) Fighter (e) Ship Figure 5. Five sets of eight view images. The viewing angle if from left to right.