FACIAL FEATURE PARSING AND LANDMARK DETECTION VIA LOW-RANK MATRIX DECOMPOSITION

FACIAL FEATURE PARSING AND LANDMARK DETECTION VIA LOW-RANK MATRIX DECOMPOSITION Rui Guo and Hairong Qi University of Tennessee, Knoxville Department of Electrical Engineering and Computer Science Knoxville, TN 37996, USA ABSTRACT Facial feature parsing is an active research topic in image understanding. In this paper, we address the feature parsing problem by applying low-rank matrix decomposition on facial images. Specifically, the face is considered as a combination of the skin background which resides in a low dimensional subspace with the salient feature components (e.g., eyes, nose and mouth) as the sparse noise. Given a face image, the feature parsing problem can then be naturally formulated as sparse noise detection when recovering a low-rank matrix. To enhance the feature parsing, a linear transformation matrix is learned to boost the discriminant feature extraction. Furthermore, with the derived parsing maps, the algorithm is easily extended to implement facial landmark detection task. The effectiveness of the proposed algorithm is evaluated through several experiments on comprehensive datasets. Index Terms— Facial feature parsing, Low-rank matrix decomposition, Landmark detection 1. INTRODUCTION In image understanding, facial feature parsing refers to the task of segmenting face images into different facial feature components, e.g., eyes, nose and mouth. The study of facial parsing is an attractive area due to the importance role it plays in multiple applications, including human identity recognition, animation, demographic analysis [1], facial image synthesis [2] and face image sketching [3]. All these applications ask for accurate segmentation that is robust to expression, pose and illumination variations. Most existing works accomplish this by localizing landmarks on the face as the initial points, then refine them on a pixel-by-pixel basis using classification or regression till complete segmentation of the regions of interest. To boost performance, some template matching models [4] and graphic models [5] are assumed known a-priori. In this paper, we address the parsing problem from a new perspective and focus on facial feature detection from the face image instead of assigning label information for each pixel. Compared to previous methods, this detection-based approach is more efficient since it does not need to train the component descriptors on a pixel-by-pixel basis. The facial features are treated as an entire set and can be detected at one time. In addition, we emphasize that the alignment of all faces is not necessary as required by most existing approaches. When compared to skin background, the facial features contain discriminant shape, texture information, making them salient on the face. The intuitive idea to implement parsing is then to separate the salient components from the background. Our algorithm employs low-rank matrix decomposition method [6] which considers the skin background as the matrix spanning in low dimensional subspace and the facial features with their discriminant characteristics as sparse noise. Based on the parsing results, we also extend the work to landmark detection. From the extensive evaluation, the proposed detection-based parsing algorithm receives favorable comparison with the state-of-the-art techniques. In summary, this work makes contributions in the following aspects, • It is the first time that low-rank matrix decomposition is introduced to solve the facial feature parsing problem. The proposed algorithm is detection-based method which is the original work in this area; • Compared to the existing parsing methods, our algorithm considers the facial features as a whole and detects them simultaneously in one decomposition process with high accuracy; • With the parsing results, we can easily extend the work to accomplish facial landmark detection. The high parsing accuracy guarantees the detection results to receive competitive performance with the state-of-theart. 2. RELATED WORK In general, the facial parsing tasks are implemented in two ways: landmark-based approaches and segmentation-based approaches [7]. On one hand, the landmark-based method computes and marks each pixel on the face with semantic part label based on the pixel attribute. The performance heavily relies on the initialization process. The improved version of such approach includes the Markov Shape Model [8] which considers the local line segments and appearance to alleviate from the dependency on the initialization. However, other problems still remain since such technique asks for expensive computation and tends to fail in real-world applications. On the other hand, the segmentation-based approach is proposed by considering the computational efficiency and robustness to pose and expression variations. In [4], a component-based discriminant search algorithm was designed. Multiple facial component detectors were combined to detect the facial features. In recent research, since facial feature have unique geometric configurations and appearances, the sparse matching and deep learning network were adopted to detect and correlate the facial features and identify their belongings [9]. However, there are still drawbacks in these methods. First, facial features are difficult to be given a consistent model. For example, there is no clear shape model to describe the mouth with different expressions. Pose variation may also cause appearance change in different scenarios. Once it failed to generate the model, it is impossible for matching process to locate the correct components on face. Second, it is difficult for a single detector to precisely locate all facial components. Thus, we either need to incorporate more detectors or fail in parsing process. Inspired by the saliency detection work in [10, 11], we model the facial parsing as the salient feature detection on face. The motivation of this work is that, the facial features, e.g., eyes, nose and mouth, have their unique appearances and thus making them visually salient as compared with the skin texture. After applying the low-rank matrix decomposition on the feature matrix, as the sparse noise, the facial feature components are readily separated from the recovered skin background. The algorithm is detection-based without the need of any high-level prior knowledge or models. That means if the facial components disappear due to occlusion, the algorithm cannot predict the positions of the features. the equivalent problem results in (L∗ , S ∗ ) = arg min(kLk∗ + λkSk1 ) L,S (2) s.t. F = L + S where k · k∗ represents the nuclear norm and k · k indicates the L1 norm. Solvers of the LRR problem are proposed in many research articles. We adopt the most popular RobustPCA method [6] which is more extendable and flexible in many cases. We are interested in the decomposition result S ∗ which contains the extracted parsing map in gray-scale. The highlighted pixels in S ∗ demonstrate the segmented facial features. 3.2. Facial Image Representation Given a face image, we first apply the Cascade Face Detector [12] to locate the face region. The detected face region is augmented to the uniform size of 256 × 256. To acquire a comprehensive representation of the face, similar to [10], we extract multi-modality visual features of the face region which are, • Color Features in HSV color space. The image hue, saturation and brightness value are used to represent the color information that result in 3 color feature maps; • Texture Steerable Pyramids [13]. The steerable pyramid filter is adopted to extract the texture information. We use the filter responses in 4 orientations and 3 scales that results in 12 pyramid maps; • Gabor Wavelet Features. Gabor filter is also performed to explore more detailed texture features. We use Gabor wavelet filters in 6 orientations and 3 scales on the image, yielding totally 18 wavelet feature maps. 3. PARSING ALGORITHM Before we introduce the parsing algorithm in detail, an overview of low-rank matrix decomposition is given in Section 3.1. After that, we describe the proposed feature representation in Section 3.2 and transformation matrix learning in Section 3.3. The landmark detection process is discussed in Section 3.4. 3.1. Low-rank Matrix Representation Instead of using the spatial information of the image, the face is represented in feature space and denoted as F. We model the facial image as a combination of low-rank skin background L and facial features as sparse noise S. The prototype of matrix decomposition by low-rank representation (LRR) [6] is formulated as the following, (L∗ , S ∗ ) = arg min(rank(L) + λkSk0 ) L,S (1) s.t. F = L + S where, L∗ and S ∗ are optimized results of the skin background and the noise residual, respectively. k · k0 represents the L0 norm. Since solving Eq. (1) is NP-hard, a convex surrogate of All of these feature maps are properly normalized to reduce the cross-feature variations. They are vertically stacked to formulate the feature vector. In the image space, we equally divide the face region into 4 × 4 non-overlapping cells. For each cell, the 33 feature maps are derived and the mean value of each map is calculated and concatenated to form a 33 × 1 feature vector, fi , i = 1, · · · , 4, 096, to represent the entire cell. All the feature vectors of the face image form the matrix F = [f1 , f2 , ...fN ], F ∈ RD×N , where D is the dimension of the feature vector (33) and N is the number of cells (4096 in our case). Matrix F is the facial image representation in the feature space. 3.3. Learning Process of Linear Transformation Matrix The matrix of the facial skin background is naturally low-rank in face image. To boost its characteristics, a linear transformation matrix T is employed that is learned in the feature space. For each training image, the image feature representation is extracted according to Sec. 3.2. A diagonal matrix M = diag(m1 , m2 , ..., mN ) is generated from the ground truth to indicate feature vectors belonging to the skin background or not with the value 1 or 0. With such configuration, the transformation matrix T is learnable by solving the optimization problem, T ∗ = arg min( T K 1 X kT Fk Mk k∗ − γkT k∗ ) K k=1 (3) s.t. kT k2 = c where K is the total number of training images and k is the image index. To avoid a trivial result, we add a constraint to keep the L2 norm of T to be a small constant (c = 2 in our experiments). For the solver of Eq. (3), we adopt the gradient descent method [10, 14]. The role of the indicator matrix M is that it zeros out all the vectors in the feature matrix if it is not skin background. Therefore, the transformation matrix T is indeed forcing T F M to learn the features from background and meanwhile keeping it in low-rank form. The learnt T ∗ is used for all the testing images for parsing. In the testing stage, we multiply T ∗ with facial image representation F and derive the parsing map by applying Eq. (2) on T ∗ F . 3.4. Post-processing and Landmark Detection With the parsing map, our algorithm can be easily extended to detect the landmark points on face. In our work, we consider five landmarks which are left eye center (LE), right eye center (RE), nose tip (N), left mouth corner (LM) and right mouth corner (RM). This setting coincides with [15] and most commonly used for landmark detection. The landmark points are extracted based on the parsing map. We first apply a Gaussian map [16] on the parsing result to enhance the detected feature. Then the kN N algorithm is used to separate feature components. To acquire the clear boundary for each feature component, ellipse matching [17] is applied for each class. The eye center points are equivalent to the centers of the eye classes and so as the nose tip. For the mouth corners, they are boundary points calculated by the mouth class. Fig. 1. Hand-labeled points on FACES image and the generated bounding rectangles for training. designed to be robust against the variations. Notice that, eyebrows are not considered in our indicator matrix M , so that they are not the expected components in parsing although they are visually salient on face. To demonstrate the effectiveness of the transformation matrix T , we also show performance by applying LRR on the facial feature representation matrix without T . The parsing results are shown in Fig. 3. Clearly, the transformation matrix T enhances the capability to learn discriminant features which is beneficial to distinguish the facial components from the face region. For the LFPW dataset, it is more challenging because the faces contain more variations, such as pose, illumination change and occlusion. As we mentioned, the proposed algorithm is detection-based, once the facial components are occluded by other object and do not visually exist in the face region, our method cannot predict their locations. It is reasonable for real-world application scenarios. So we do not count it as failure. The parsing results are illustrated in Fig. 3. 4.2. Experiment II: Landmark Detection The five-points landmark detection is validated by applying the method in Sec. 3.4 from derived parsing maps. The detected landmarks are demonstrated in Fig. 4. Quantitative performance is measured with average detection error, 4. EXPERIMENTS We conduct two experiments on two challenging datasets FACES [1] and LFPW [18] to evaluate the effectiveness of the proposed algorithm. Both datasets come with manually labeled landmarks. We perform transformation matrix learning based on FACES with randomly selected 1000 images. The FACES image contains 80 human labeled landmarks along the entire face. We select the related points to formulate the bounding rectangles which cover eyes, nose and mouth regions, as shown in Fig. 1. Based on these rectangles, the indicator matrix M is established according to the discussion in Sec. 3.3. All the images are uniformly resized into 256 × 256 as the input. 4.1. Experiment I: Face Parsing We test the proposed parsing algorithm on FACES initially. FACES images are challenging in terms of the large range of ages and variation in expressions. The proposed algorithm is err = p (x − x0 )2 + (y − y 0 )2 /l (4) where (x, y) is the ground truth and (x0 , y 0 ) the detected location of landmark point. l represents the width of the bounding box. It is the same criteria used in [15]. If an error is larger than 5%, it is counted as a failed detection. The failure rate is also reported. For FACES, except for the 1000 training images, the rest images (1052) in the dataset are used for testing. For LFPW, we select 500 non-occluded images as testing samples. The comparison results are listed in Fig. 2. From the experiments, method [15] receives the best performance due to its encoding of the high-level information. Without such configuration and fine tuning, the proposed algorithm still receives favorable results on benchmark database. The merits of the algorithm are the compact structure and simple training procedure. The effectiveness is completely validated through the experiments. LFPW dataset LFPW dataset 60 14 12 Liang[4] 10 8 Valstar[5] 6 Sun[15] 4 Proposed* Failure Rates(%) Average Error Rates(%) 16 2 0 50 40 Liang[4] Valstar[5] 30 Sun[15] 20 Proposed* 10 0 LE RE N LM RM Average LE RE LM RM Average FACES dataset 9 35 8 30 7 6 5 Valstar[5] 4 Sun[15] 3 Proposed 2 1 0 Failure Rates(%) Average Error Rates(%) FACES dataset N 25 20 Valstar[5] 15 Sun[15] Proposed 10 5 0 LE RE N LM RM Average LE RE N LM RM Average Fig. 2. Qualitative comparison on LFPW and FACES datasets. The average rates are demonstrated in the last column in each figure. Our algorithm received comparable results on both datasets. Fig. 4. Landmark detection demonstration. The top row contains the images from FACES and the bottom row images come from LFPW. Fig. 3. Parsing map demonstration. The first row contains the original input faces with Cascade Face Detector localized face regions. The second row contains the parsing map without transformation matrix T . The last row illustrates the parsing maps generated by the proposed algorithm. The parsing maps without T are polluted with unrelated pixels and the proposed method detects more regions on the facial components. 5. CONCLUSION In this paper, we proposed a novel face parsing algorithm where the facial features are segmented by modeling them as sparse noises via low-rank matrix decomposition. To improve parsing accuracy, a learnt linear transformation matrix T is applied on the facial image representation matrix. With the generated parsing maps, we further extended the work for landmark detection on face. The proposed algorithm was tested on two benchmark datasets and demonstrated competitive performance as compared to the state-of-the-art techniques. 6. REFERENCES [1] Guodong Guo, Rui Guo, and Xin Li, “Facial expression recognition influenced by human aging.,” IEEE Trans. Affective Computing, vol. 4, no. 3, pp. 291–298, 2013. [2] Brian Amberg, Andrew Blake, Andrew W. Fitzgibbon, Sami Romdhani, and Thomas Vetter, “Reconstructing high quality face-surfaces using model based stereo,” in IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8. [3] Xiaogang Wang and Xiaoou Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1955–1967, 2009. [4] Lin Liang, Rong Xiao, Fang Wen, and Jian Sun 0001, “Face alignment via component-based discriminative search.,” in ECCV (2). 2008, vol. 5303 of Lecture Notes in Computer Science, pp. 72–85, Springer. [5] Michel Valstar, Brais Martinez, Xavier Binefa, and Maja Pantic, “Facial point detection using boosted regression and graph models,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010. June 2010, pp. 2729–2736, IEEE Computer Society. [6] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013. [7] Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, and Jianchao Yang, “Exemplar-based face parsing.,” in CVPR. 2013, pp. 3484–3491, IEEE. [8] Lin Liang, Fang Wen, Xiaoou Tang, and Ying-Qing Xu, “An integrated model for accurate shape alignment.,” in ECCV (4). 2006, vol. 3954 of Lecture Notes in Computer Science, pp. 333–346, Springer. [9] Xiaoou Tang, Xiaogang Wang, and Ping Luo, “Hierarchical face parsing via deep learning,” 2013 IEEE Conference on Computer Vision and Pattern Recognition, vol. 0, pp. 2480– 2487, 2012. [10] Xiaohui Shen and Ying Wu, “A unified approach to salient object detection via low rank matrix recovery,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 853–860. [11] Congyan Lang, Guangcan Liu, Jian Yu, and Shuicheng Yan, “Saliency detection by multitask sparsity pursuit,” IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1327–1338, 2012. [12] Paul Viola and Michael J. Jones, “Robust real-time face detection,” Int. J. Comput. Vision, vol. 57, no. 2, pp. 137–154, May 2004. [13] Eero P. Simoncelli and William T. Freeman, “The steerable pyramid: A flexible architecture for multi-scale derivative computation,” in IEEE ICIP, 1995, pp. 444–447. [14] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, Aug. 2010. [15] Yi Sun, Xiaogang Wang, and Xiaoou Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, CVPR ’13, pp. 3476–3483. [16] Tilke Judd, Krista A. Ehinger, Frdo Durand, and Antonio Torralba, “Learning to predict where humans look.,” in ICCV. 2009, pp. 2106–2113, IEEE. [17] Tim Ellis, Ahmed Abbood, and Beatrice Brillault, “Ellipse detection and matching with uncertainty,” in Proceedings of the British Machine Vision Conference. 1991, pp. 18.1–18.9, BMVA Press. [18] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, CVPR ’11, pp. 545–552.

FACIAL FEATURE PARSING AND LANDMARK DETECTION VIA LOW-RANK MATRIX DECOMPOSITION

Related documents

Products

Support

FACIAL FEATURE PARSING AND LANDMARK DETECTION VIA LOW-RANK MATRIX DECOMPOSITION

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib