FACIAL FEATURE PARSING AND LANDMARK DETECTION VIA LOW-RANK MATRIX DECOMPOSITION

advertisement
FACIAL FEATURE PARSING AND LANDMARK DETECTION VIA LOW-RANK MATRIX
DECOMPOSITION
Rui Guo and Hairong Qi
University of Tennessee, Knoxville
Department of Electrical Engineering and Computer Science
Knoxville, TN 37996, USA
ABSTRACT
Facial feature parsing is an active research topic in image understanding. In this paper, we address the feature parsing
problem by applying low-rank matrix decomposition on facial images. Specifically, the face is considered as a combination of the skin background which resides in a low dimensional subspace with the salient feature components (e.g., eyes,
nose and mouth) as the sparse noise. Given a face image, the
feature parsing problem can then be naturally formulated as
sparse noise detection when recovering a low-rank matrix. To
enhance the feature parsing, a linear transformation matrix is
learned to boost the discriminant feature extraction. Furthermore, with the derived parsing maps, the algorithm is easily
extended to implement facial landmark detection task. The
effectiveness of the proposed algorithm is evaluated through
several experiments on comprehensive datasets.
Index Terms— Facial feature parsing, Low-rank matrix
decomposition, Landmark detection
1. INTRODUCTION
In image understanding, facial feature parsing refers to the
task of segmenting face images into different facial feature
components, e.g., eyes, nose and mouth. The study of facial parsing is an attractive area due to the importance role
it plays in multiple applications, including human identity
recognition, animation, demographic analysis [1], facial image synthesis [2] and face image sketching [3]. All these
applications ask for accurate segmentation that is robust to
expression, pose and illumination variations. Most existing
works accomplish this by localizing landmarks on the face
as the initial points, then refine them on a pixel-by-pixel basis using classification or regression till complete segmentation of the regions of interest. To boost performance, some
template matching models [4] and graphic models [5] are assumed known a-priori.
In this paper, we address the parsing problem from a new
perspective and focus on facial feature detection from the face
image instead of assigning label information for each pixel. Compared to previous methods, this detection-based approach is more efficient since it does not need to train the
component descriptors on a pixel-by-pixel basis. The facial
features are treated as an entire set and can be detected at one
time. In addition, we emphasize that the alignment of all faces
is not necessary as required by most existing approaches.
When compared to skin background, the facial features
contain discriminant shape, texture information, making them
salient on the face. The intuitive idea to implement parsing is then to separate the salient components from the background. Our algorithm employs low-rank matrix decomposition method [6] which considers the skin background as
the matrix spanning in low dimensional subspace and the facial features with their discriminant characteristics as sparse
noise.
Based on the parsing results, we also extend the work to
landmark detection. From the extensive evaluation, the proposed detection-based parsing algorithm receives favorable
comparison with the state-of-the-art techniques. In summary, this work makes contributions in the following aspects,
• It is the first time that low-rank matrix decomposition is introduced to solve the facial feature parsing
problem. The proposed algorithm is detection-based
method which is the original work in this area;
• Compared to the existing parsing methods, our algorithm considers the facial features as a whole and detects them simultaneously in one decomposition process with high accuracy;
• With the parsing results, we can easily extend the work
to accomplish facial landmark detection. The high
parsing accuracy guarantees the detection results to
receive competitive performance with the state-of-theart.
2. RELATED WORK
In general, the facial parsing tasks are implemented in two
ways: landmark-based approaches and segmentation-based
approaches [7]. On one hand, the landmark-based method
computes and marks each pixel on the face with semantic part
label based on the pixel attribute. The performance heavily
relies on the initialization process. The improved version of
such approach includes the Markov Shape Model [8] which
considers the local line segments and appearance to alleviate from the dependency on the initialization. However, other
problems still remain since such technique asks for expensive
computation and tends to fail in real-world applications. On
the other hand, the segmentation-based approach is proposed
by considering the computational efficiency and robustness
to pose and expression variations. In [4], a component-based
discriminant search algorithm was designed. Multiple facial
component detectors were combined to detect the facial features. In recent research, since facial feature have unique geometric configurations and appearances, the sparse matching
and deep learning network were adopted to detect and correlate the facial features and identify their belongings [9]. However, there are still drawbacks in these methods. First, facial
features are difficult to be given a consistent model. For example, there is no clear shape model to describe the mouth
with different expressions. Pose variation may also cause appearance change in different scenarios. Once it failed to generate the model, it is impossible for matching process to locate
the correct components on face. Second, it is difficult for a single detector to precisely locate all facial components. Thus,
we either need to incorporate more detectors or fail in parsing
process.
Inspired by the saliency detection work in [10, 11], we
model the facial parsing as the salient feature detection on
face. The motivation of this work is that, the facial features,
e.g., eyes, nose and mouth, have their unique appearances and
thus making them visually salient as compared with the skin
texture. After applying the low-rank matrix decomposition on
the feature matrix, as the sparse noise, the facial feature components are readily separated from the recovered skin background. The algorithm is detection-based without the need of
any high-level prior knowledge or models. That means if the
facial components disappear due to occlusion, the algorithm
cannot predict the positions of the features.
the equivalent problem results in
(L∗ , S ∗ ) = arg min(kLk∗ + λkSk1 )
L,S
(2)
s.t. F = L + S
where k · k∗ represents the nuclear norm and k · k indicates
the L1 norm. Solvers of the LRR problem are proposed in
many research articles. We adopt the most popular RobustPCA method [6] which is more extendable and flexible in
many cases. We are interested in the decomposition result
S ∗ which contains the extracted parsing map in gray-scale.
The highlighted pixels in S ∗ demonstrate the segmented facial features.
3.2. Facial Image Representation
Given a face image, we first apply the Cascade Face Detector [12] to locate the face region. The detected face region
is augmented to the uniform size of 256 × 256. To acquire
a comprehensive representation of the face, similar to [10],
we extract multi-modality visual features of the face region
which are,
• Color Features in HSV color space. The image hue,
saturation and brightness value are used to represent the
color information that result in 3 color feature maps;
• Texture Steerable Pyramids [13]. The steerable pyramid filter is adopted to extract the texture information.
We use the filter responses in 4 orientations and 3 scales
that results in 12 pyramid maps;
• Gabor Wavelet Features. Gabor filter is also performed
to explore more detailed texture features. We use Gabor wavelet filters in 6 orientations and 3 scales on the
image, yielding totally 18 wavelet feature maps.
3. PARSING ALGORITHM
Before we introduce the parsing algorithm in detail, an
overview of low-rank matrix decomposition is given in Section 3.1. After that, we describe the proposed feature representation in Section 3.2 and transformation matrix learning in
Section 3.3. The landmark detection process is discussed in
Section 3.4.
3.1. Low-rank Matrix Representation
Instead of using the spatial information of the image, the face
is represented in feature space and denoted as F. We model the facial image as a combination of low-rank skin background L and facial features as sparse noise S. The prototype of matrix decomposition by low-rank representation (LRR) [6] is formulated as the following,
(L∗ , S ∗ ) = arg min(rank(L) + λkSk0 )
L,S
(1)
s.t. F = L + S
where, L∗ and S ∗ are optimized results of the skin background and the noise residual, respectively. k · k0 represents
the L0 norm.
Since solving Eq. (1) is NP-hard, a convex surrogate of
All of these feature maps are properly normalized to reduce the cross-feature variations. They are vertically stacked
to formulate the feature vector. In the image space, we equally divide the face region into 4 × 4 non-overlapping cells. For
each cell, the 33 feature maps are derived and the mean value
of each map is calculated and concatenated to form a 33 × 1
feature vector, fi , i = 1, · · · , 4, 096, to represent the entire
cell. All the feature vectors of the face image form the matrix
F = [f1 , f2 , ...fN ], F ∈ RD×N , where D is the dimension
of the feature vector (33) and N is the number of cells (4096
in our case). Matrix F is the facial image representation in
the feature space.
3.3. Learning Process of Linear Transformation Matrix
The matrix of the facial skin background is naturally low-rank
in face image. To boost its characteristics, a linear transformation matrix T is employed that is learned in the feature
space. For each training image, the image feature representation is extracted according to Sec. 3.2. A diagonal matrix
M = diag(m1 , m2 , ..., mN ) is generated from the ground
truth to indicate feature vectors belonging to the skin background or not with the value 1 or 0. With such configuration,
the transformation matrix T is learnable by solving the optimization problem,
T ∗ = arg min(
T
K
1 X
kT Fk Mk k∗ − γkT k∗ )
K
k=1
(3)
s.t. kT k2 = c
where K is the total number of training images and k is the
image index. To avoid a trivial result, we add a constraint to
keep the L2 norm of T to be a small constant (c = 2 in our
experiments). For the solver of Eq. (3), we adopt the gradient
descent method [10, 14]. The role of the indicator matrix M
is that it zeros out all the vectors in the feature matrix if it is
not skin background. Therefore, the transformation matrix T
is indeed forcing T F M to learn the features from background
and meanwhile keeping it in low-rank form. The learnt T ∗ is
used for all the testing images for parsing. In the testing stage,
we multiply T ∗ with facial image representation F and derive
the parsing map by applying Eq. (2) on T ∗ F .
3.4. Post-processing and Landmark Detection
With the parsing map, our algorithm can be easily extended to
detect the landmark points on face. In our work, we consider
five landmarks which are left eye center (LE), right eye center
(RE), nose tip (N), left mouth corner (LM) and right mouth
corner (RM). This setting coincides with [15] and most commonly used for landmark detection.
The landmark points are extracted based on the parsing
map. We first apply a Gaussian map [16] on the parsing result
to enhance the detected feature. Then the kN N algorithm
is used to separate feature components. To acquire the clear
boundary for each feature component, ellipse matching [17]
is applied for each class. The eye center points are equivalent
to the centers of the eye classes and so as the nose tip. For
the mouth corners, they are boundary points calculated by the
mouth class.
Fig. 1. Hand-labeled points on FACES image and the generated bounding rectangles for training.
designed to be robust against the variations. Notice that, eyebrows are not considered in our indicator matrix M , so that
they are not the expected components in parsing although they
are visually salient on face. To demonstrate the effectiveness
of the transformation matrix T , we also show performance
by applying LRR on the facial feature representation matrix
without T . The parsing results are shown in Fig. 3. Clearly,
the transformation matrix T enhances the capability to learn
discriminant features which is beneficial to distinguish the facial components from the face region. For the LFPW dataset,
it is more challenging because the faces contain more variations, such as pose, illumination change and occlusion. As we
mentioned, the proposed algorithm is detection-based, once
the facial components are occluded by other object and do
not visually exist in the face region, our method cannot predict their locations. It is reasonable for real-world application
scenarios. So we do not count it as failure. The parsing results
are illustrated in Fig. 3.
4.2. Experiment II: Landmark Detection
The five-points landmark detection is validated by applying
the method in Sec. 3.4 from derived parsing maps. The detected landmarks are demonstrated in Fig. 4. Quantitative
performance is measured with average detection error,
4. EXPERIMENTS
We conduct two experiments on two challenging datasets
FACES [1] and LFPW [18] to evaluate the effectiveness of
the proposed algorithm. Both datasets come with manually
labeled landmarks. We perform transformation matrix learning based on FACES with randomly selected 1000 images.
The FACES image contains 80 human labeled landmarks along the entire face. We select the related points to formulate
the bounding rectangles which cover eyes, nose and mouth
regions, as shown in Fig. 1. Based on these rectangles, the indicator matrix M is established according to the discussion in
Sec. 3.3. All the images are uniformly resized into 256 × 256
as the input.
4.1. Experiment I: Face Parsing
We test the proposed parsing algorithm on FACES initially.
FACES images are challenging in terms of the large range of
ages and variation in expressions. The proposed algorithm is
err =
p
(x − x0 )2 + (y − y 0 )2 /l
(4)
where (x, y) is the ground truth and (x0 , y 0 ) the detected location of landmark point. l represents the width of the
bounding box. It is the same criteria used in [15]. If an error is larger than 5%, it is counted as a failed detection. The
failure rate is also reported.
For FACES, except for the 1000 training images, the rest
images (1052) in the dataset are used for testing. For LFPW,
we select 500 non-occluded images as testing samples. The
comparison results are listed in Fig. 2.
From the experiments, method [15] receives the best
performance due to its encoding of the high-level information. Without such configuration and fine tuning, the
proposed algorithm still receives favorable results on benchmark database. The merits of the algorithm are the compact
structure and simple training procedure. The effectiveness is
completely validated through the experiments.
LFPW dataset
LFPW dataset
60
14
12
Liang[4]
10
8
Valstar[5]
6
Sun[15]
4
Proposed*
Failure Rates(%)
Average Error Rates(%)
16
2
0
50
40
Liang[4]
Valstar[5]
30
Sun[15]
20
Proposed*
10
0
LE
RE
N
LM
RM
Average
LE
RE
LM
RM
Average
FACES dataset
9
35
8
30
7
6
5
Valstar[5]
4
Sun[15]
3
Proposed
2
1
0
Failure Rates(%)
Average Error Rates(%)
FACES dataset
N
25
20
Valstar[5]
15
Sun[15]
Proposed
10
5
0
LE
RE
N
LM
RM
Average
LE
RE
N
LM
RM
Average
Fig. 2. Qualitative comparison on LFPW and FACES datasets. The average rates are demonstrated in the last column in each
figure. Our algorithm received comparable results on both datasets.
Fig. 4. Landmark detection demonstration. The top row contains the images from FACES and the bottom row images
come from LFPW.
Fig. 3. Parsing map demonstration. The first row contains the
original input faces with Cascade Face Detector localized face
regions. The second row contains the parsing map without
transformation matrix T . The last row illustrates the parsing
maps generated by the proposed algorithm. The parsing maps
without T are polluted with unrelated pixels and the proposed
method detects more regions on the facial components.
5. CONCLUSION
In this paper, we proposed a novel face parsing algorithm
where the facial features are segmented by modeling them
as sparse noises via low-rank matrix decomposition. To im-
prove parsing accuracy, a learnt linear transformation matrix
T is applied on the facial image representation matrix. With
the generated parsing maps, we further extended the work for
landmark detection on face. The proposed algorithm was tested on two benchmark datasets and demonstrated competitive
performance as compared to the state-of-the-art techniques.
6. REFERENCES
[1] Guodong Guo, Rui Guo, and Xin Li, “Facial expression recognition influenced by human aging.,” IEEE Trans. Affective
Computing, vol. 4, no. 3, pp. 291–298, 2013.
[2] Brian Amberg, Andrew Blake, Andrew W. Fitzgibbon, Sami
Romdhani, and Thomas Vetter, “Reconstructing high quality
face-surfaces using model based stereo,” in IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8.
[3] Xiaogang Wang and Xiaoou Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
pp. 1955–1967, 2009.
[4] Lin Liang, Rong Xiao, Fang Wen, and Jian Sun 0001, “Face
alignment via component-based discriminative search.,” in
ECCV (2). 2008, vol. 5303 of Lecture Notes in Computer Science, pp. 72–85, Springer.
[5] Michel Valstar, Brais Martinez, Xavier Binefa, and Maja Pantic, “Facial point detection using boosted regression and graph
models,” in IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2010. June 2010, pp. 2729–2736, IEEE
Computer Society.
[6] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong
Yu, and Yi Ma, “Robust recovery of subspace structures by
low-rank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184,
2013.
[7] Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, and
Jianchao Yang, “Exemplar-based face parsing.,” in CVPR.
2013, pp. 3484–3491, IEEE.
[8] Lin Liang, Fang Wen, Xiaoou Tang, and Ying-Qing Xu, “An
integrated model for accurate shape alignment.,” in ECCV (4).
2006, vol. 3954 of Lecture Notes in Computer Science, pp.
333–346, Springer.
[9] Xiaoou Tang, Xiaogang Wang, and Ping Luo, “Hierarchical
face parsing via deep learning,” 2013 IEEE Conference on
Computer Vision and Pattern Recognition, vol. 0, pp. 2480–
2487, 2012.
[10] Xiaohui Shen and Ying Wu, “A unified approach to salient object detection via low rank matrix recovery,” in 2012
IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 853–860.
[11] Congyan Lang, Guangcan Liu, Jian Yu, and Shuicheng Yan,
“Saliency detection by multitask sparsity pursuit,” IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1327–1338,
2012.
[12] Paul Viola and Michael J. Jones, “Robust real-time face detection,” Int. J. Comput. Vision, vol. 57, no. 2, pp. 137–154, May
2004.
[13] Eero P. Simoncelli and William T. Freeman, “The steerable pyramid: A flexible architecture for multi-scale derivative
computation,” in IEEE ICIP, 1995, pp. 444–447.
[14] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via
nuclear norm minimization,” SIAM Rev., vol. 52, no. 3, pp.
471–501, Aug. 2010.
[15] Yi Sun, Xiaogang Wang, and Xiaoou Tang, “Deep convolutional network cascade for facial point detection,” in Proceedings of the 2013 IEEE Conference on Computer Vision and
Pattern Recognition, 2013, CVPR ’13, pp. 3476–3483.
[16] Tilke Judd, Krista A. Ehinger, Frdo Durand, and Antonio Torralba, “Learning to predict where humans look.,” in ICCV.
2009, pp. 2106–2113, IEEE.
[17] Tim Ellis, Ahmed Abbood, and Beatrice Brillault, “Ellipse
detection and matching with uncertainty,” in Proceedings of
the British Machine Vision Conference. 1991, pp. 18.1–18.9,
BMVA Press.
[18] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar,
“Localizing parts of faces using a consensus of exemplars,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2011, CVPR ’11, pp. 545–552.
Download