Surface Reconstruction from Intensity Image using Illumination Model based Morphable Modeling Paper ID: 20 Paper ID: 20 Abstract. We present a new method for reconstructing depth of a known object from a single still image using deformed underneath sign matrix of a similar object. Existing Shape from Shading(SFS) methods try to establish a relationship between intensity values of a still image and surface normal of corresponding depth, but most of them resort to error minimization based approaches. Given the fact that these reconstruction approaches are fundamentally ill-posed, they have limited successes for surfaces like a human face. Photometric Stereo (PS) or Shape from Motion (SfM) based methods extend SFS by adding additional information/constraints about the target. Our goal is identical to SFS, however, we tackle the problem by building a relationship between gradient of depth and intensity value at the corresponding location of the same object. This formula is simplified and approximated for handing different materials, lighting conditions and, the underneath sign matrix is also obtained by resizing/deforming Region of Interest(ROI) with respect to its counterpart of a similar object. The target object is then reconstructed from its still image. In addition to the process, delicate details of the surface is also rebuilt using a Gabor Wavelet Network(GWN) on different ROIs. Finally, for merging the patches together, a Self-Organizing Maps(SOM) based method is used to retrieve and smooth boundary parts of ROIs. Compared with state of art SFS based methods, the proposed method yields promising results on both widely used benchmark datasets and images in the wild. Keywords: 3D surfaces, depth reconstruction, SFS, morphable modeling, surface deforming, human perception. 1 Introduction We focus on the problem of reconstructing depth of an object from a single intensity image. The problem has direct relevance to many applications such as medical imaging, enhanced face recognition, 3D printing, and towards answering the long standing question of understanding human depth perception. Humans have a remarkable capability to perceive the 3D shape by looking at a 2D monocular image. Enabling computer vision systems to do the same 2 Paper ID: 20 Fig. 1: The input data and generated data. From left to right: input raw sensor range data; input target image; output reconstructed 3D surface. still remains a challenging task. The exact problem, as formulated as early as in 1970 [7], to obtain the shape of a smooth opaque object from a single view, is called shape-from-shading problem. Significant research has been done in this area over the past four decades with varying levels of success [12]. The classical SFS problem is typically solved under assumptions such as single point light source, constant albedo, and Lambertian reflectance. However, such methods, while performing well on simple some images like a vase, have limited success for complex images, such as a bust[11] . The key challenge faced by these methods is that ambiguity needs to be solved is fundamentally ill-posed, i.e., for the given intensity there could be multiple valid surfaces [12, 13, 16, 17]. There have been recent advances to resolve the ambiguity. However, most methods either seek additional images corresponding to the target image (e.g., photometric stereo (PS) [18, 19], multiple images or Structure from Motion [20, 22]) or require knowledge of the context, such as the illumination model [11]. Such methods start from a known reference 3D surface shape and then establish point correspondence between the reference shape and the input image. To reach an acceptable solution, constraints are applied. Recently, the work by Barron and Malik [8, 10] has made advances in the field of intrinsic image model and SFS by simultaneously extracting multiple aspects of an image, including shape, albedo, and illumination from a single intensity image. In this paper, we are trying to solve the same problem without using any additional context based constraints and knowing depth information of target object, instead, we propose a new method based on illumination model and an object similar to the target. The illumination model establish a relationship between gradient at each point and its corresponding intensity value. The input raw data, still image and generated data of our method are shown in Fig 1. 1.1 workflow As the workflow shown in Fig. 2, our approach has 5 steps. In step 1, we identify ROIs from both reference depth and target image in the following manner: first, we identify keypoints in reference depth by finding local maximum/minimum Title Suppressed Due to Excessive Length Target Image Reference Depth Step 1: Selection of Key-Points / Regions Target Image 3 Reference Depth Step 2: Matching of corresponding ROIs to obtain underneath sign matrix (Section 2.3) Step 3: Reconstruction using illumination model (Section 2.1, 2.2) Step 4: GWN based details reconstruction (Section 3) Step 5: SOM based boundaries smoothing (Section 4) output surface Fig. 2: Overview of Proposed Tasks. The inputs to the reconstruction algorithm are the target image, a reference depth. The algorithm expects general correspondence between the the target and reference images. Mathematically, the correspondence should be such that the depth-intensity relationship for the reference object is the same as that of the target object(we shall explain the relationship using a formula and a sign matrix in Section 2). For practical purposes, given the target object, we choose a much similar reference object. ∂z ∂z = 0 and ∂y = 0 (usually we select the region around local maximum where ∂x where local minimum determine boundaries). next, we determine corresponding ROIs in target image manually( in the Section 2.2, we will be introducing a semiautomated way doing this). In step 2, the ROIs of reference depth is resized to match the size of its counter parts of target image. In step 3, we could build a sign matrix (along x or y axis, 1,0,-1 indicate the slope of depth is growing up, non-changed, down, respectively, e.g. we can decide the sign of the first row and then the sign of columns below that row). In step 3, the depth of target image is reconstructed. In step 4, for recovering details of target image, a GWN based method is used. In the final step, a SOM based method is used to retrieve and smooth boundaries of ROIs. 4 Paper ID: 20 Line of sight Line of sight θ Luminous Intensity I = Imax cos θ n̂ Camera 2α n̂max Target Surface x 2α θ n̂ ẑ xmax 2α Target Surface α−β x β x̂ (B) (A) Fig. 3: Illumination model used in this paper. In figure (A), from observer or camera’s perspective, the angle θ is eitherpositive(clockwise), or negative(couter-clockwise), which gives us |θ| = arccos Ix . At the bottom Imax ∂z = tan (α − β) is the ∂x of figure (B), there is a cartesian coordinate system, where surface gradient of normal n̂ at = tan (α − β + β) = tan (α) is the surface gradient of surface point x, and ∂x∂z max normal n̂max at surface point xmax . As shown in figure (B), at point xmax , the light is reflected along the light of sight, i.e, positive ẑ direction, and we assume the observer receives the maximum intensity Imax according to refx. From the origin of the cartesian coordinate system, normal n̂ can be viewed as rotation of the normal n̂max counterclockwise of angle β. The key idea here is we need to calculate θ using α and β. Notice the normal n̂max bisects the incoming(incident) light and outgoing(reflected) light ray at the point xmax , and therefore, at the point x, the angle between normal n̂ and incoming light ray is 2α − (α − β) = α + β. That is to say, the angle between incoming light and outgoing light at point xmax is 2(α+β). So in the end, θ = 2(α+β)−2α = 2β. 2 2.1 3D Reconstruction Method Basic Illumination Model To make our work easier to be comprehended, we will to address the idea before giving any formula. Mathematically, in two dimensional euclidean space, say z ∆z and x are axis orthogonal to each other, as long as the gradient ∆x is known, the depth z any any point could be integrated from a known point xstart . In other words, if we regard the space is discrete, and step-wise ∆z (with respect to equal-length step ∆x) could be inferred or calculated at every point, then the summation of ∆z from starting point xstart to ending point xend along the path of summation, i.e., Σ∆zi , is the relative height zend − zstart . Therefore, the problem, in our case, is to find the relationship between partial ∂z ∂z (or ∂y ), of given point (x, y), with respect to the intensity value Ixy gradient ∂x Title Suppressed Due to Excessive Length Tangent of argument in radians 5 Inverse cosine in radians 100 3.5 80 3 60 40 2.5 20 2 0 1.5 −20 −40 1 −60 0.5 −80 −100 −2 −1.5 −1 −0.5 0 0.5 1 1.5 0 −1 2 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Fig. 4: Tangent and Inverse Cosine Functions. at the point: ∂z = fx (Ixy ) ∂x or ∂z = fy (Ixy ) ∂y (1) This is much the same formula as stated in the traditional SFS problem: I(x) = s(n(x)), where the n(x) is the normal vector of the location vector x. As shown in Fig. 3, the luminance, or often called brightness, from the observer’s perspective, or line of sight, can be represented as the luminous intensity per projected area normal to the line of observation [6] (notice in Fig. 3(B), we assume that a single light source is located at infinity such that light falls on each point of the surface from the same direction). To be more accurate, the angle between line of sight and incoming light ray, i.e., θ, at point x, can be inferred by the angle between normal of x, i.e. n̂ and normal of maximum intensity valuereceived Ix . by observer, i.e., n̂max , which gives us θ = 2β, or β = 21 θ = ∓ 21 arccos Imax We notice here the gradient of surface point x has exact the same angle as ∂z tan (α − β), or in other words, ∂x = tan (α − β). Put above formulas together, we have: ∂z 1 Ix = tan (α − β) = tan α ± arccos (2) ∂x 2 Imax 2.2 Handling Issues caused by Lighting Condition and Material Of course, it would be straight forward to think of implementing the formula (2) directly, and that will lead to 2-dimensional integration. However, the formula only works well in ideal conditions, and for the image in the wild, shading, especially material of surface will play important roles in the surface reconstruction. Therefore, we introduce an approximation of above formula, and an offset factor, to counterbalance the non-ideal situations. We consider firstly of the linear attribute of arccos and tan functions. See Fig. 4, most values for both Inverse Cosine function and Tangent function, in 6 Paper ID: 20 their domain of definition, can be approximated using linear relation. Thus, we make following approximation: tan(x) = k1 x + b1 and arccos(x) = k2 x + b2 (from Fig. 4, we can see the values k1 ,k2 ,b1 and b2 are constant values). Instead of using continuous fashion of expression, we would use discrete manner of partial derivative for the purpose of digital computation. Therefore the formula (2) can be rewritten as: 1 I ∆z = k1 α ± k2 + b2 + b1 ∆x 2 Imax (3) As discussed, issues caused by lighting condition, different materials, impose an offset value for intensity values received by observer/camera. The offset, we empirically assume it has constant value bof f set . Moreover, the choice of this value is affected when taking color model into account. Conversion of a color image to grayscale is not a unique process: weight on different channels affect the output grayscale. Usually we prefer to use the principles of photometry to match the luminance of the grayscale image to the luminance of the original color image(in reality, RGB model has different weight combination as compared with YUV model). Since this is not the topic of this paper, we would stop here and we refer reader to read document [14]. Simplified from above formula, we have: ∆z = ±(k · I + b) + b0 + bof f set = ±(k · I + b) + b̃ ∆x Because we do not know the sign before the quantity 1 2 arccos (4) Ix Imax in (2), we cannot find the exact expression for the gradient. Let ρ(x) ∈ {−1, +1} be a binary indicator function which indicates the direction of the gradient. If ρ(x) is known, we get: ∆z = ρ(x)(k · I + b) + b̃ (5) ∆x Using the formula 5, we can determine candidate local maximum/minimum ∆z ∆z points from target image by letting ∆x = 0 and ∆y = 0, then manually prune unnecessary points. Given the gradient in (5), we can reconstruct the relative depth for the xstart and xend as: z= xX end h i ρ(x)(k · Ix + b) + b̃ ∆x (6) xstart We did an interesting experiment using formula (6) to demonstrate our ideas. As shown in Fig. 5, we assume the color of sky(dark blue), pyramid(light yellow), and face(misty rose), and empirically select bof f set for them(in our case, these values are -112, -201, -212 respectively and, the numbers are represented in signed 16-bit integer fashion). Given the same the sign function at the bottom row, the step-wise ∆z is calculated(the third row) and the summation(the fourth row) proves the reconstructed depth is correct(fourth row). In this experiment, Title Suppressed Due to Excessive Length Sampling from Sky 7 Sampling from Mozart Sampling from Pyramid +1 +1 +1 0 0 0 −1 −1 −1 Fig. 5: An experiment to prove the approximated illumination model. The first row: input images;the second row: a slice of intensity values sampled from images above; the third row: recovered absolute value of ∆z using formula (5); the fourth row: reconstructed depth; the fifth row: underneath sign vector. Notice all these 3 experiments use the same sign vector. the calculated ∆z for sky is always 0, and so, no matter what sign function is, the depth will be summed up to 0. But we can see that the ∆z for pyramid and mozart’s face take on different values, and without alignment of sign function below, the depth could not be reconstructed correctly. Therefore in our method, 8 Paper ID: 20 Hemisphere Deformed Hemisphere Reconstructed Surface from Image Ground Truth Fig. 6: Reconstructed surface using sign matrix from a deformed hemisphere. From left to right: original hemisphere, deformed hemisphere, reconstructed depth from a still image on its up-right corner(within pink color circle); ground truth of the depth in the pink color circle. Notice here, the summation sequence start from centerline, and then to both sides. the purpose of resizing similar object’s ROIs, is to obtain correct alignment of the sign matrix/function. 2.3 Reconstruction using Sign Matrix Finally we consider double summation over more general regions. Suppose that the region R is defined by G1 (x) 6 y 6 G2 (x) with a 6 x 6 b. This is called a vertically simple region. The double summation is given by z= xX 2 (x) h end G X ρxy (kx · Ixy + bx ) + b̃x ih i ρxy (ky · Ixy + by ) + b̃y ∆x∆y xstart G1 (x) = xX end xstart h G2 (x) h i i X ρxy (kx · Ixy + bx ) + b̃x ∆x ρxy (ky · Ixy + by ) + b̃y ∆y (7) G1 (x) As shown in formula (7), recovering the shape of object is still determined by the two important factors: one is the underneath sign matrix, the other is intensity values. Of course, the offset plays an import role here too. This introduce an interesting topic: The sign matrix ρxy could be easily obtained by deforming a similar object’s surface, i.e., by deforming an existing object, not only the depth of the morphable object at each location is changed, but also the underneath sign matrix is changed. Now we do another experiment to prove our idea. Take Fig. 6 as an example. Here we have a hemisphere, and our target is to reconstruct target object using its still image. The first step is to estimate underneath sign matrix. It can be seen in the Fig. 6, by deforming surface of a hemisphere, the sign matrix is obtained exactly the same as that of the ground truth. Next, using formula (7), we are Title Suppressed Due to Excessive Length 9 able to recover the surface from the intensity values within the pink circle. The result is pretty similar to the shape of ground truth, which confirm proposed idea and the approximation are correct. 3 Reconstruction of Surface Details Since deformed shape keep original features of reference object, the details of target object needs to be reconstructed too. In terms of imposing details, traditional SFS can perform well. Here we adopt a strategy using a GWN, which will keep the details of target image and not disturb the rough surface. Proposed method take all ROIs as a whole, and minimize errors in batch manner: Ki Ki N X X X ||Ii − [s, θ, w] = arg min wij ψi ||22 + β |wij | (8) s,θ,w i=1 j=1 j=1 θ and s orientation factor and scale factor of Gabor wavelets, wij and ψij the jth coefficient and its corresponding wavelet on ith P ROI respectively. In order to Ki prevent over-fitting, we add a regularization term β j=1 |wij |. Here β is penalty factor for the L1 norm of vector [wi1 , ..., wiKi ]. 4 Self-Organizing Maps Before merging different ROIs, their boundaries are usually rough and a smoothing process is required. Instead of finding out a smoothing strategy, here we propose a depth retrieval method using existing surface boundary parts. This issue has been addressed by an interesting recent paper [23], where input depth is divided into five facial parts via the alignment, and each facial part is matched independently to the dataset resulting in five high-resolution meshes. They use azimuth angle and elevation angle for measuring the similarity between two patches. Our method make stored depth “learn” target boundaries and therefore the best match are gradually smoothed by learning two boundaries. The depth patches comes from public dataset [2, 4]. Traditionally, there are two operational modes for a SOM, training and mapping. During training, the learning example is compared to the weight vectors associated with each neuron and the closest winning neuron is selected. The weights of all the neurons are then updated using the following update equation: ωk (t + 1) = ωk (t) + α(t)η(ν, k, t)||ωk (t) − x||2 (9) Here ωk (t) is the weight for the k th neuron at tth iteration, x is the input vector, and ν is the index of the winning neuron. α() gives the learning rate which monotonically decreases with is t. A neighborhood function which measures the distance between a given neuron and the winning neuron. Typically, η takes a 10 Paper ID: 20 ∆ ν,k Gaussian form, η(ν, k, t) = 2σ(t) 2 , where ∆(, ) is the distance between two neurons on the grid, and σ is the monotonically decreasing neighborhood width. The SOM algorithm assumes that the input vectors are semantically homogeneous. In our case, we attach the stored depth map of boundary parts at each neuron. During the training, in each round, the errors between two adjacent ROIs w.r.t. the boundary part are calculated, and the winning neuron should have the least errors. We summarize the idea in Algorithm 1. Input : Adjacent Patch R, Adjacent PatchR̄, Number of rounds n. Output: Patch Nî,ĵ . 1 2 3 4 5 6 7 8 9 10 11 12 13 Initialize 2-dimensional matrix N of size a × b with stored depth of the same type of patches; Initialize set of training set S = {R, R̄}; for c ← 1 to n do for k ← 1 to 2 do Find winning neuron ν = Nî,ĵ for Sk using formula (9); for i ← 1 to a do for j ← 1 to b do Update Nij w.r.t. ν and Sk ; end end end Finding final winning neuron Nî,ĵ ; end Algorithm 1: Parallel SOM Algorithm 5 Experiments In order to demonstrate robustness of our method, we test our method on both benchmark data and images in the wild. 5.1 Benchmark Datasets The first set of evaluation was conducted on a public dataset of RGB and depth images of objects [1, 3, 5]. In Fig. 7, we show comparison of our method with enhanced SFS in terms of depth errors. For all three benchmark objects, our method can achieve better reconstruction result compared with enhanced SFS on average. This phenomenon comes majorly from the fact that our result is calculated using an integration/summation process, which leads to a fair accurate output as a whole. Instead, traditional SFS-based method focus on local ambiguity, even in natural Title Suppressed Due to Excessive Length Ground Truth Our Result Enhanced SFS 11 Color-coded Depth Error Our Result Enhanced SFS 11cm 7cm 3cm 2,500µm 1,500µm 500µm 2,000µm 1,250µm 500µm Fig. 7: A comparison in terms of depth errors between our method and enhanced SFS(best viewed in color). lighting environment, reconstructed surface converges to the value of gray-scale or intensity values. Some SFS based methods will inevitably converge to global minimum/maximum if their models are fundamentally convex or concave. Take MPI vase for example( the first row in Fig. 7), the boundary part are successfully reconstructed and perform better than our method, however, for the bulge part of this vase, the enhanced SFS simply did not recover the depth, compared with ground truth and, the error amounts to around 14cm to 15cm. The next set of experiments are performed on benchmark[5] for comparing normal errors among traditional SFS, enhanced SFS and our method. We maintain the similar lighting condition as [11]( see leftmost figure of Mozart in the first row of Fig. 8 ). It can be seen that traditional SFS-based method converges to local intensity values, which give effect of “deep trench”, while enhanced SFS overcome the the problem by adding natural illumination constraints. In our case, however, the depth information comes from accumulation of a portion of intensity, and therefore, the rough normal error is minimized. Moreover, using Gabor wavelet makes sure mean value and covariance are allocated along the direction to minimize error of reconstruction. This gives our method advantages over both traditional and enhanced SFS methods. Then we numerically compare our method to three state of art methods: traditional SFS [15], PS, and the recent SAIFS method [10]1 on Stanford benchmark [3]. Result is shown in Table 1. The proposed method outperforms both SFS and SAIFS on all three benchmark images by a factor of 2 or 3. The average performance is even better than PS which shows while our method performs on heavily shaded regions, but in other areas it is able to reconstruct the depth effectively. 1 12 Paper ID: 20 Ground Truth SFS [Tsai and Shah, 1994] Enhanced SFS [Johnson and Adelson, 2011] Ours 80◦ 50◦ 20◦ Fig. 8: A comparison in terms of normal errors among our method, traditional SFS and enhanced SFS(best viewed in color). The first row: target image(leftmost column) and reconstructed surface; the second row: normal map of target surface and reconstructed surface; the third row: normal error of reconstructed surface. Model Dragon Armadillo Buddha SFS 962.4 1067.4 1251.7 SAIFS 1915.6 2217.1 2405.2 PS 492.7 515.3 603.1 Ours 417.1 497.5 542.9 Table 1: Comparison of average reconstruction error of proposed method and existing methods. Error is measured in µm. 5.2 Images in the Wild We especially wish to see how our method can handle the issue of shading and natural lighting condition, as well as the problems caused by different materials. We select images of famous people from internet. Take Fig. 9 for example, eye brow, mustache hair take different “color” compared with regular skin. What is more, the lighting condition is natural such that our assumption of a single source of light does not hold too. The results are shown in Fig. 9. This result is especially interesting in the sense, as long as the corresponding underneath sign matrix is similar enough to the counterpart of target, reconstructing a satisfactory surface is possible. Title Suppressed Due to Excessive Length 13 Fig. 9: Result of reconstruction for images in the wild. 6 Conclusions and Future Work We have shown a depth recovery method for certain object from a still image by deforming the underneath sign matrix of a similar object. The algorithm handles reflectance problem from different material or lighting condition very well by applying an approximated formula of the proposed illumination model, which is a major contribution of our work. In terms of recovery, unlike PS based methods, given the fact that very little depth knowledge is known of target, our method can effectively reconstruct complex surface like face. For each ROI on target image, the details of the surface is recovered using GWN. To merge the different ROIs, a SOM based method is used to retrieve and smooth boundary parts of ROIs. The current ROIs are manually selected according to the number of local maximum points, so in the future, we would like to explore an automatic way for finding regions. References 1. MPI-Inf 3d data, 2. VAP dataset, 3. The Stanford 3d scanning repository, 3Dscanrep/ 4. Thingiverse, 5. UCF shape database, 6. RCA: RCA Electro-Optics Handbook, pp. 18–19, RCA/Commercial Engineering in Harrison, N.J (1974). 7. Horn,B. K. :Shape from shading. Doctoral Thesis, Cambridge, MA, USA (1970). 8. Barron, J. T., Malik,J. : High-frequency shape and albedo from shading using natural image statistics. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 2521– 2528 (2011). 14 Paper ID: 20 9. Horn,B. K. , Brooks, M. J.: The variational approach to shape from shading. Comput. Vision Graph. Image Process, vol. 33, pp. 174–208 (1986). 10. Barron, J. T., Malik,J. : Shape, albedo, and illumination from a single image of an unknown object. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 334–341, (2012). 11. Johnson, M. K., Adelson, E. H.: Shape estimation in natural illumination. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 2553–2560 (2011). 12. Zhang, R. and Tsai, P.-S. and Cryer, J. E. and Shah, M.: Shape from shading: A survey. IEEE Trans. PAMI, vol: 21, pp. 690–706 (1999). 13. Lee, K. M. and Kuo, C. C. J.: Shape from shading with a linear triangular element surface model. IEEE Trans. PAMI, vol: 15, pp. 815–822 (1993). 14. Volz, H.G.: Industrial Color Testing: Fundamentals and Techniques. Wiley-VCH, ISBN 3-527-30436-3(2001). 15. Tsai, P.-S., Shah, M.: Shape from shading using linear approximation. Image and Vision Computing, vol:12, pp. 487–498(1994). 16. R. Kozera: Uniqueness in shape from shading revisited. J. Math. Imaging Vis., vol:7, pp.123–138(1997). 17. J. Oiliness: Existence and uniqueness in shape from shading. In: Pattern Recognition, 1990. Proceedings., 10th International Conference on, vol.1, pp. 341–345(1990). 18. R. J. Woodham: Photometric method for determining surface orientation from multiple images. Optical Engineering, vol:19, pp.139–144(1980). 19. Wu, C.L., Wilburn, B., Matsushita, Y., Theobalt, C.: High-quality shape from multi-view stereo and shading under general illumination. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pp. 969– 976(2011). 20. Bregler, C., Hertzmann, A., Biermann H.: Recovering non-rigid 3d shape from image streams. In: CVPR, vol: 2, pp. 690–696, IEEE Computer Society(2000). 21. Rusinkiewicz, S., Hall-Holt, O., Levoy, M.: Real-time 3d model acquisition. ACM Trans. Graph., vol: 21:438–446 (2002). 22. Garg, R., Roussos,A., de Agapito, L.: Dense variational reconstruction of non-rigid surfaces from monocular video, In: IEEE CVPR, pp. 1272–1279(2013). 23. Liang, S., Kemelmacher-Shlizerman, I., Shapiro, L. G.: 3D Face Hallucination from a Single Depth Frame. In: International Conf. on 3D Vision (3DV), Tokyo(2014).