Feature Selection and Dimensionality Reduction Using Principal

advertisement
Feature Selection and Dimensionality Reduction Using Principal Feature
Analysis
Ira Cohen, Thomas S. Huang
Abstract
Dimensionality reduction of a feature set is a common
preprocessing step used for pattern recognition and
classification applications. It has been employed in
compression schemes as well. Principal component
analysis is one of the popular methods used, and can be
shown to be optimal in some sense. However, it has the
disadvantage that measurements of all of the original
features are needed for the analysis. This paper proposes
a novel method for dimensionality reduction of a feature
set by choosing a subset of the original features that
contains most of the essential information, using the same
criteria as the PCA. We call this method Principal
Feature Analysis. We successfully applied the method for
choosing principal motion features useful for face
tracking and for dimensionality reduction of a face
database which can be used as an initial step before face
recognition. We also present results on the facial image
reconstruction and recognition using the Principal
features selected and compare them to the ones obtained
using PCA and the one from [5].
1. Introduction
In many real world problems the reduction of the
dimensionality of a problem is an essential step before any
analysis of the data can be performed. The general
criterion for reducing the dimension is the desire to
preserve most of the relevant information of the original
data according to some optimality criteria. In pattern
recognition and general classification problems, methods
such as Principal Component Analysis (PCA) and Fisher
Linear Discriminant Analysis have been extensively used.
These methods find a linear mapping between the original
feature set to a lower dimensional feature set.
In some applications it might be desired to pick a subset
of the original features rather then find some mapping
which uses all of the original features. The benefits of
finding this subset of features could be in cost of
computations of unnecessary features, cost of sensors (in
real life measurement systems), and in excluding noisy
features while keeping their information using “clean”
features (for example tracking points on a face using easy
to track points- and inferring the other points based on
those few measurements).
Variable selection procedures have been used in different
settings. Among them, the regression area has been
investigated extensively. In [1] a Multi Layer Perceptron
is used for variable selection. In [2], stepwise discriminant
analysis for variable selection is used as inputs to a neural
network that performs pattern recognition of circuitry
faults. Other regression techniques for variable selection
are well described in [3]. In contrast to the regression
methods, which lack unified optimality criteria, the
optimality properties of PCA have attracted research on
variable selection methods which are based on PCA [4-7].
As will be shown, these methods have the disadvantage of
either being to computationally expensive, or choose a
subset of features leaving a lot of redundant information.
In this paper we propose a method which exploits the
structure of the principal components of a feature set to
find a subset of the original feature vector which will still
have the same optimality criteria as Principal Component
Analysis with a minimal number of features.
In Section 2 we will review the existing PCA based
feature selection methods and give the basic notations of
the problem, section 3 will describe our method and
explain the motivation and its advantage over other
methods. In section 4 we will present application of the
method on two examples: Selection of facial feature
points which convey most of the information for tracking
algorithms, and use the method for dimensionality
reduction by selecting the “principal pixels” of faces in a
frontal face database of different people. We will show
that the reconstruction and recognition of the original
faces using those pixels is equivalent to the performance
of the PCA applied on the same database of faces.
2. Preliminaries and Notations
We consider a linear transformation of a random vector
X   n with zero mean and covariance matrix x to a
q
lower dimension random vector Y   , q  n :
Y  AqT X
and AqT Aq  I q where Iq is the qxq identity matrix.
(1)
Suppose we want to estimate X from Y, the Least
Square estimate of X (which is also the Minimum Mean
Square Estimate in the Gaussian case) is given as
(2)
Xˆ  (Ak )( AkT Ak ) 1 Y
In Principal Component Analysis Ak is an nxq matrix
whose columns are the q orthonormal eigenvectors
corresponding to the first q largest eigenvalues of the
covariance matrix x. There are 10 optimal properties for
this choice of the linear transformation [4]. One important
property is the maximization of the “spread” of the points
in the lower dimensional space which means that the
points in the transformed space are kept as far apart as
possible, and therefore retaining the variation of the
original space. This property gave the motivation for the
use of PCA in classification problems, since it means that
in most cases we will keep the projected features as far
away from each other as possible, thus having a lower
probability of error. Another important property is the
minimization of the mean square error between the
predicted data to the original data. This property is useful
for applications involving prediction and lossy
compression of the data.
Now, suppose we want to choose a subset of the original
variables/features of the random vector X. This can be
viewed as a linear transformation of X using a
transformation matrix
 Iq

(3)
Ak  

[0] ( q  n) xq 
or any matrix which is permutations of the rows of
There have been several methods proposed to find
‘optimal’ Ak. Without loss of generality lets consider
transformation matrix Ak as given above,
corresponding covariance matrix of X is given as
{12 }qx( n  q ) 
 {11}qxq


{ 21}( n  q ) xq { 22 }( n  q ) x ( n  q ) 
A k.
the
the
the
(4)
In [4] it was shown that it is not possible to satisfy all of
the optimality properties of PCA for the same subset.
Finding the subset which maximizes:
(5)
|  Y || 11 |
is equivalent to maximization of the “spread” of the points
in the lower dimensional space, thus retaining the
variation of the original data.
Minimizing the mean square prediction error is equivalent
to minimizing the trace of
1
 22|1   22   2111
12
(6)
This can be seen since the retained variability of a subset
can be measured using




n
 trace( 22|1 ) 
Retained Variability  1 

100
%
 i2 (7)

n


i 1
 i2 


i 1


where i is the standard deviation of the i’th feature.
This method is very appealing since it does satisfy well
defined properties. Its shortcoming is in the complexity of
finding the subset. It is not computationally feasible to
find this subset for a large feature vector since all of the
 n
possible   combinations have to be looked at. So
q
suppose we are looking to find a set of 10 variables out of
20, it will involve computing either one of the measures
for 184756 possible combinations. Another method
proposed in [5] uses the Principal Components
themselves. The coefficient of each principal component
can give an insight to the effect of each variable on that
axes of the transformed space. If the i’th coefficient of one
of the Principal Component is high compared to the
others, it implies that the xi element of X is very dominant
in that axes/PC. By choosing the variables corresponding
to the highest coefficients of each of the first q PC’s, we
maintain a good estimate of the same properties as the PC
transformation. This method is a very intuitive method,
and does not involve heavy computations, but since it
considers each PC independently, variables with the same
information might be chosen causing a lot of redundant
information in the obtained subset. This method was
effectively used in applications where the PC coefficients
are discretized based on the highest coefficient values. An
example of this can be seen in [8], where the authors
project a feature set (Optical Flow of lip tracking) to a
lower dimensional space using PCA, but in fact use a
simple linear combination of the original feature
determined by setting the highest coefficients of the
chosen Principal Components to 1 corresponding to just
a few of the features. Another method proposed in [6] and
[7] chooses a subset of size q by computing its PC
projection to a smaller dimensional space, and minimizing
measure given using the Procrustes Analysis [9]. This
method helps reduce the redundancy of information, but
again involves high computations since many
combinations of subsets are explored.
In our method, we too exploit the information which can
be inferred by the PC coefficients to obtain the optimal
subset of features, but unlike the method in [5], we will
use all of the PC’s together to gain a better insight on the
structure of our original features so we can choose
variables without any redundancy of information. In the
next section we will describe this method.

3. Principal Feature Analysis

3.1 Motivation
To explain the motivation behind the analysis, let us
observe
a
very
simple
example.
Let
X  x1 x 2 x3 x 4 T be a zero mean normally
distributed random feature vector. The variables are
constructed in the following manner:
(a) x1 , x 2 are statistically independent
(b) x3  x1  n1
(c) x 4  x1  x 2  n 2 ,
where n1, n2 are a zero mean normally distributed random
variables with a much lower variance then x1 and x2.
Suppose we want to choose a subset of variables from the
feature vector. Knowing the statistics, it is quite evident
that choosing x2 and either x1 or x3 would give a very good
representation of the entire data, since x1 and x3 are very
highly correlated, and x4 has a mixture of information
from both x2 and the set {x1,x3}. This information can be
extracted from the Principal Components of the data. To
see this we compute the Principal components, and see
that the first two PC’s retain 99.8% of the variability in
the data. Let
 a11 a12 
a
a 22 
A2   21
 a 31 a 32 


a 41 a 42 
be the matrix whose columns are the first two Principal
components corresponding to the highest eigenvalues of
the covariance matrix. The coefficients of the row vectors
of this matrix can be viewed as the weights of a single
feature on the lower dimensional space, i.e., if aij is high,
the i’th feature has a lot of effect on the projection of the
data to the j’th lower dimensional axes. If we look at the
structure of these vectors with respect to the other row
vectors, we can see that they will be very similar for the
features which are highly correlated, and very different for
features which are not correlated. Figure 1 shows the plot
of the four row vectors of this example.
As can be seen from the figure, the row vectors
corresponding to x1 and x3 are very near, the vector
corresponding to x2 is far from them, and the vector
corresponding to x4 is somewhere in between. Knowing
that we have chosen a 2-Dimensional subspace, if we
cluster these vectors to 2 clusters (say {x2,x4}, and {x1,x3})
we could choose one variable from each cluster using
some metric to decide on the best feature. If we choose
the vectors in each cluster that are farthest apart from the
other cluster, the natural choice would be to choose x2 and
x1. Choosing this combination retains 98% of the
variability in the data, computed using equation (7).
Choosing x4 however will not yield good representation of
the data (90% of the variability). Although x4 holds
information on all of the variables, it has no discriminate
power in the subspace, and choosing it with any of the
other will result in a lot of redundant information. This
simple example shows that the structure of the row vectors
corresponding to the first q PC’s carry information on the
dependencies between features and in the lower
dimensional space. The number of features which should
be chosen should be at least the dimension of the subspace
q, and it has been shown that a slightly higher number of
features should be taken in order to achieve the same
performance as the q dimensional PCA [4,5].
1
0.8
x2
0.6
0.4
x4
0.2
0
-0.2
x3
x1
-0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 1: Plot of the PC coefficients corresponding to the four
variables. The X-axis = first PC, Y-Axis = second PC
3.2 Principal Feature Analysis: The Algorithm
Let X be a zero mean n-dimensional random feature
vector. Let  be the covariance matrix of X (which could
be in correlation form as well). Let A be a matrix whose
columns are the orthonormal eigenvectors of the matrix ,
computed using the singular value decomposition of the
:
(8)
  AAT

 1



.
0
 ,     ...  
where   
2
n

 1
0 .


n 

and AT A  I n
Let Aq be the first q columns of A and let
V1 , V ,...Vn   q be the rows of the matrix Aq.
The vector Vi corresponds to the i’th feature (variable) in
the vector X, and the coefficients of Vi correspond to the
weights of that feature on each axes of the subspace.
Features that are highly correlated will have similar
absolute value weight vectors (absolute value is necessary
since changing the sign of one variable changes the signs
of the corresponding weights but has no statistical
significance [5]). In order to find the best subset we will
use the structure of these rows to first find the features
which are highly correlated to each other and then choose
from each group of correlated features the one which will
represent that group optimally in terms of high spread in
the lower dimension, reconstruction and insensitivity to
noise. The algorithm can be summarized in the following
five steps:
Step 1 Compute the Sample Covariance matrix, or use
the true covariance matrix if it is available. In
some cases it would be preferred to use the
Correlation matrix instead of the Covariance
matrix. The Correlation matrix is defined as the
nxn matrix whose i,j’th entry is
E xi x j
(9)
 ij 
E[ x i2 ]E[ x 2j ]


This representation is preferred in cases where
the features have very different variances from
each other, and using the regular covariance form
will cause the PCA to put very heavy weights on
the features with the highest variances. See [5]
for more details.
Step 2 Compute the Principal components and
eigenvalues of the Covariance/Correlation matrix
as defined in equation (8).
Step 3 Choose the subspace dimension q and construct
the matrix Aq from A. This can be chosen by
deciding how much of the variability of the data
is desired to be retained. The retained variability
can be computed using:
q
Variability Re tained 

i
i 1
n

100 %
(10)
i
nearest to the mean is twofold. This feature can
be thought of as the central feature of that
cluster- the one most dominant in it, and which
holds the least redundant information of features
in other clusters. Thus it satisfies both of the
properties we wanted to achieve- large “spread”
in the lower dimensional space, and good
representation of the original data. Note that if a
cluster has just two components in it, the chosen
feature will be the one which has the highest
variance of the coefficients, since this means that
it has better separability power in the subspace.
By choosing the principal features using this algorithm,
we choose the subset that represents well the entire feature
set both in terms of retaining the variations in the feature
space (by using the clustering procedure), and keep the
prediction error at a minimum (by choosing the feature
who’s vector is closest to the mean of the cluster).
The complexity of the algorithm is of the order of
performing the PCA, since the K-Means algorithm is
applied to just n vectors and will normally converge after
very few iterations. This is somewhat more expensive
from the method proposed in [5], but it takes into account
the information given by the entire set of chosen Principal
components thus finding better the interdependencies of
the features. The method does not optimize the criteria
given in [4], but it does approximates them and allows
application of the method to large sets of features, a task
which is impossible for the method in [4].
An important byproduct of this method is the ability to do
prediction of the original features, with a known accuracy
by using the LS estimate of them given by
 Iq  ~
(11)
Xˆ  
1  X
 2111 
~
where X is the chosen subset of features.
i 1
Step 4 Cluster the vectors | V1 |, | V2 |,...,| Vn |  q to
p  q clusters using K-Means algorithm [15].
The distance measure used for the K-Means
algorithm is the Euclidean distance. The vectors
are clustered in p clusters and the means of each
cluster is computed. This is an iterative stage
which repeats itself until the p clusters are found
and do not change. The reason to choose p
greater then q in some cases is if the same
retained variability as the PCA is desired, a
slightly higher number of features is needed
(Usually 1-5 more are enough).
Step 5 In each cluster, find the corresponding vector Vi
which is closest to the mean of the cluster.
Choose the corresponding feature xi as a
principal feature. This will yield the choice of p
features. The reason for choosing the vector
4. Experiments
To show the results of the Principal Feature Analysis, we
apply the method to two examples of possible
applications.
4.1 Facial Motion Feature Points Selection
The first is selection of the important points which should
be tracked on a human face in order to account for the
non-rigid motion. This is a classic example of the need to
do feature selection since it can be very expensive, and
maybe impossible to track many points on the face
reliably. Methods using optical flow [11],[ 8], or methods
using patches [12], which have been proposed, could
benefit by allowing selection of fewer meaningful features
and thus decrease the error and time of the tracking. In
addition, prediction of the remaining features can be
performed using the LS estimation. This can be used for
driving an animated figure based on the tracking results.
Since the tracking of the points in the development stage
has to be accurate, we used markers located on many
facial points. Tracking of these labels is done
automatically using template matching techniques, and the
results are checked manually for error correction. Thus we
have reliable tracking result for the entire video sequence
(60 sec at 30 frames per second) of human facial motion
performing several normal actions- smiling, frowning,
acting surprised, and talking. We estimate the 2-D nonrigid facial motion vector for each feature point over the
entire sequence after accounting for the global head
motion using stationary points (nose tip). The images in
Figure 2 demonstrate some facial expressions that appear
in the video sequence. In order to avoid singularity of the
covariance matrix, time periods that have no motion at all
are not taken into account. There are a total of 40 facial
points that are being tracked. For the Principal feature
analysis, the points are split to two groups, upper face
(eyes and above), and lower face. Each point is
represented by its horizontal and vertical direction, and
therefore the actual number of features we have is 80. We
compute the Correlation matrix of each group after
subtracting its Mean, and apply the Principal feature
analysis to choose the important features (points and
direction of motion), while retaining 90% of the
variability in the data. Figure 3 shows the results of the
analysis. The chosen features are marked by arrows
displaying the principal direction chosen for that feature
point. It can be seen that the chosen features correspond to
physical based models of the face, i.e., vertical motion
was chosen for the middle lip point, with more features
chosen around the lips then other lower facial regions.
Vertical motion features were chosen for the upper part of
the face (with the inner eyebrows chosen to represent the
horizontal motion which appeared in the original
sequence). This implies that much of the lower-face
region’s motion can be inferred using mainly lip motion
tracking (an easier task from the practical point of view).
In the upper part of the face, points on the eyebrows were
chosen, and in the vertical direction mostly, which is in
agreement with the physical models. It can also be seen
that less points are needed for tracking in the upper part of
the face (7 principal motion points) then the lower part of
the face (9 principal motion points) since there are fewer
degrees of freedom in the motion of the upper part of the
face. This analysis is comparable to the classic facial
motion studies made by Ekman [16] in which facial
movement displaying emotions were mapped to Action
Units, corresponding to facial muscles movements. The
example shows that the Principal Feature analysis can
model a difficult physical phenomenon such as the face
motion to reduce complexity of existing algorithms by
saving the necessity of measuring all of the features. This
is in contrast to PCA which will need the measurements of
all of the original motion vector to do the same modeling.
4.2 Dimensionality Reduction of Faces
The second experiment will show the performance of
Principal Feature analysis for reducing the dimensionality
of a frontal face image. We will also compare face
recognition results using Principal Feature analysis, PCA
and the method described in [5]. Face recognition
schemes have used PCA to reduce the dimensionality of
the feature vector, which is just the original image of a
person’s face put into vector form by concatenating the
rows [10]. Another space reduction method used was
Fisher Linear Discriminant [13]. These methods were
used because of the realization that discrimination in the
high dimension suffers from ‘the curse of dimensionality’
problem and is computationally very expensive. We used
a set of 93 facial images of different people of size 33x32
taken from the Purdue Face Database [14]. There were 51
males and 42 female subjects, with and without glasses,
with and without beards, taken under the same strict
lighting conditions, with a neutral facial expression. To
make the problem somewhat more realistic, the images
were not perfectly aligned as can be seen in the examples
shown in Figure 4 (a). The result of the Principal Feature
Analysis can be seen in Figure 4 (1). It shows the pixels
that were chosen highlighted on one of the faces from the
database. There were 56 pixels chosen (out of 1056),
accounting for 90% of the variability. The loss in terms of
variability using the same number of Principal
components was 5%. This is a slightly higher number than
expected, but it is caused by the imperfection of the data.
Figure 4 (b) shows the reconstruction of the some of the
images using those 56 pixels and the prediction
transformation matrix given by Equation (12). It can be
seen that the reconstruction appears accurate, and even
subjects with glasses (10 out of the entire set), or beards
(2 out of the set) were reconstructed well. Figure 4(c)
shows the reconstruction using the method in [5], and it
can be seen that there is visible loss of details, especially
around the mouth of the subjects. Table 1 shows the
mean-square error measured over the entire set for the 3
methods. Again, as expected, the Principal feature method
outperforms the method in [5], and is comparable
(although higher) then the PCA analysis using 56 PC’s,
and is exactly the same as the error using 45 PC’s. As can
be seen in Figure 4, this difference is not apparent in the
reconstructed images.
As a second part of the experiment, we used the
principal feature analysis to do recognition on the faces in
the database. We used a second set of the faces of the
same people as the training set (93 people). The faces in
the second set were smiling. We used the NearestNeighbor classifier. As can be seen in Table 1, the
recognition rate using Principal Feature Analysis is the
same as the recognition rate using PCA with 56 PC’s
(94%), and slightly higher than PCA using 45 PC’s
(92%). In contrast to this, the method of [5] performed
poorly on the test set (78%). Checking the faces where the
errors were made showed that the Principal Feature
Analysis made the errors for the same faces as PCA, and
gave the same wrong classification results for all but one
of these errors. This strengthens the claim that the method
approximates well the Principal Component Analysis. The
advantage of using the Principal Feature analysis over
PCA in this case is that although we are measuring all of
the features anyways, the transformation to the lower
dimensional space involves no computations at all, a fact
which can speed up face recognition without loss of
accuracy.
5. Summary
In this paper we have proposed a novel method for
dimensionality reduction using Principal Feature Analysis.
By exploiting the structure of the principal components of
a set of features, we can choose the principal features
which retain most of the information both in the sense of
maximum variability of the features in the lower
dimensional space and in the sense of minimizing the
reconstruction error. We have shown that this method can
give results equivalent to PCA, or with a comparable
small difference. This method should be used in the cases
where the cost of using all of the features can be very
high, either in terms of computational cost, measurement
cost or in terms of reliability of the measurements. The
most natural application of this method is in the
development stage, and can be used to pick out a feature
vector out of a large possible feature vector, as shown in
the first experiment. It can also be used for reconstruction
of the original large set of features, which can be used for
the first experiment to drive an animated face, a task
which normally needs many facial points as an input. But,
as we have shown in the second example, it can also be
used in broader span of applications, like the
dimensionality reduction used for appearance based face
recognition.
6. References
[1] Lisboa, P.J.G., Mehri-Dehnavi, R., “Sensitivity Methods
for Variable Selection Using the MLP”, International Workshop
on Neural Networks for Identification, Control, Robotics and
Signal/Image, 1996, pp. 330-338.
[2] Lin, T.-S., Meador, J., “Statistical Feature Extraction and
Selection for IC Test Pattern Analysis”, Circuits and Systems,
1992, pp.391-394 vol 1.
[3] Hocking, R.R, “Development in Linear Regression
Methodology: 1959-1982”, Technometrics, 1983, pp. 219-249
vol. 25
[4] McCabe, G.P., “Principal Variables”, Technometrics, 1984,
pp. 137-134 vol.26.
[5] Jolliffe, I.T., Principal Component Analysis, SpringerVerlag, New-York, 1986.
[6] Krzanowski, W.J., “Selection of Variables to Preserve
Multivariate Data Structure, Using Principal Component
Analysis”, Applied Statistics- Journal of the Royal Statistical
Society Series C, 1987, pp. 22-33 vol. 36
[7] Krzanowski, W.J., “A Stopping Rule for structurePreserving Variable Selection”, Statistics and Computing,
March 1996, pp. 51-56 vol. 6.
[8] Mase, K, Pentland, A, “Automatic Lipreading by OpticalFlow Analysis”, Systems & Computers in Japan, 1991, pp. 6776 vol. 22 no. 6
[9] Gower, J.C., “Statistical Methods of Comparing Different
Multivariate Analyses of the Same Data”, In Mathematics in the
Archaeological and Historical Sciences (F.R. Hodson, D.G.
Kendall and P. Tautu editors) , University Press, Edinburgh,
1971, pp.138-149
[10] Turk, M.A., Pentland, A., “Face Recognition Using
EigenFaces”, Proceedings of CVPR, 1991, pp. 586-591.
[11] DeCarlo D, Metaxas D, ”Combining Information Using
Hard Constraints”, Proceedings of CVPR, 1999, pp. 132-138
vol. 2.
[12] Tao H, Huang T.S., “Bezier Volume Deformation Model
for Facial Animation and Video Tracking”, International
Workshop on Motion Capture Techniques for Virtual
Enviroments, 1998, pp. 242-253.
[13] Belhumeur P.N., Hespanha J.P., Kriegman D.J.,
“Eigenfaces vs. Fisherfaces: Recognition Using Class Specific
Linear Projection”, IEEE Trans. On Pattern Analysis & Machine
Intelligence, July 1997, pp. 711-720 vol. 19 no. 7.
[14] A. Martinez and R. Benavente. The AR Face Database.
CVC Technical Report #24, June 1998.
[15] Arabie, P, Hubert, L.J, De Soete, G. editors, Clustering and
Classification, World Scientific, River Edge, NJ, 1998.
[16] Ekman P., Emotion in the Human Face, Cambridge
University Press, New-York, 1982.
Figure 2. Examples of Facial Expressions
Figure 3. Principal Motion Features chosen for the facial points. The arrows show the motion direction chosen (Horizontal or Vertical)
(1)
(a)
(b)
(c)
(d)
Figur
e 4: (1) Principal pixels shown highlighted on one of the faces. (a) Original Face Images, (b) Reconstructed Images using Principal
Feature Analysis, (c) Reconstructed images using Largest Coefficient Method [5], (d) Reconstructed Images using first 56 Principal
Components
Principal Feature
Analysis (56 Pixels)
Mean Square Error
Variability Retained
Recognition Rate
15
90%
94%
Largest Coefficient
Method [5] (56
Pixels)
18
85%
78%
PCA using 56 PC’s
PCA using 45 PC’s
11
95%
94%
15
90%
92%
Table 1: Error Comparison between the methods
Download