Feature Selection and Dimensionality Reduction Using Principal Feature Analysis Ira Cohen, Thomas S. Huang Abstract Dimensionality reduction of a feature set is a common preprocessing step used for pattern recognition and classification applications. It has been employed in compression schemes as well. Principal component analysis is one of the popular methods used, and can be shown to be optimal in some sense. However, it has the disadvantage that measurements of all of the original features are needed for the analysis. This paper proposes a novel method for dimensionality reduction of a feature set by choosing a subset of the original features that contains most of the essential information, using the same criteria as the PCA. We call this method Principal Feature Analysis. We successfully applied the method for choosing principal motion features useful for face tracking and for dimensionality reduction of a face database which can be used as an initial step before face recognition. We also present results on the facial image reconstruction and recognition using the Principal features selected and compare them to the ones obtained using PCA and the one from [5]. 1. Introduction In many real world problems the reduction of the dimensionality of a problem is an essential step before any analysis of the data can be performed. The general criterion for reducing the dimension is the desire to preserve most of the relevant information of the original data according to some optimality criteria. In pattern recognition and general classification problems, methods such as Principal Component Analysis (PCA) and Fisher Linear Discriminant Analysis have been extensively used. These methods find a linear mapping between the original feature set to a lower dimensional feature set. In some applications it might be desired to pick a subset of the original features rather then find some mapping which uses all of the original features. The benefits of finding this subset of features could be in cost of computations of unnecessary features, cost of sensors (in real life measurement systems), and in excluding noisy features while keeping their information using “clean” features (for example tracking points on a face using easy to track points- and inferring the other points based on those few measurements). Variable selection procedures have been used in different settings. Among them, the regression area has been investigated extensively. In [1] a Multi Layer Perceptron is used for variable selection. In [2], stepwise discriminant analysis for variable selection is used as inputs to a neural network that performs pattern recognition of circuitry faults. Other regression techniques for variable selection are well described in [3]. In contrast to the regression methods, which lack unified optimality criteria, the optimality properties of PCA have attracted research on variable selection methods which are based on PCA [4-7]. As will be shown, these methods have the disadvantage of either being to computationally expensive, or choose a subset of features leaving a lot of redundant information. In this paper we propose a method which exploits the structure of the principal components of a feature set to find a subset of the original feature vector which will still have the same optimality criteria as Principal Component Analysis with a minimal number of features. In Section 2 we will review the existing PCA based feature selection methods and give the basic notations of the problem, section 3 will describe our method and explain the motivation and its advantage over other methods. In section 4 we will present application of the method on two examples: Selection of facial feature points which convey most of the information for tracking algorithms, and use the method for dimensionality reduction by selecting the “principal pixels” of faces in a frontal face database of different people. We will show that the reconstruction and recognition of the original faces using those pixels is equivalent to the performance of the PCA applied on the same database of faces. 2. Preliminaries and Notations We consider a linear transformation of a random vector X n with zero mean and covariance matrix x to a q lower dimension random vector Y , q n : Y AqT X and AqT Aq I q where Iq is the qxq identity matrix. (1) Suppose we want to estimate X from Y, the Least Square estimate of X (which is also the Minimum Mean Square Estimate in the Gaussian case) is given as (2) Xˆ (Ak )( AkT Ak ) 1 Y In Principal Component Analysis Ak is an nxq matrix whose columns are the q orthonormal eigenvectors corresponding to the first q largest eigenvalues of the covariance matrix x. There are 10 optimal properties for this choice of the linear transformation [4]. One important property is the maximization of the “spread” of the points in the lower dimensional space which means that the points in the transformed space are kept as far apart as possible, and therefore retaining the variation of the original space. This property gave the motivation for the use of PCA in classification problems, since it means that in most cases we will keep the projected features as far away from each other as possible, thus having a lower probability of error. Another important property is the minimization of the mean square error between the predicted data to the original data. This property is useful for applications involving prediction and lossy compression of the data. Now, suppose we want to choose a subset of the original variables/features of the random vector X. This can be viewed as a linear transformation of X using a transformation matrix Iq (3) Ak [0] ( q n) xq or any matrix which is permutations of the rows of There have been several methods proposed to find ‘optimal’ Ak. Without loss of generality lets consider transformation matrix Ak as given above, corresponding covariance matrix of X is given as {12 }qx( n q ) {11}qxq { 21}( n q ) xq { 22 }( n q ) x ( n q ) A k. the the the (4) In [4] it was shown that it is not possible to satisfy all of the optimality properties of PCA for the same subset. Finding the subset which maximizes: (5) | Y || 11 | is equivalent to maximization of the “spread” of the points in the lower dimensional space, thus retaining the variation of the original data. Minimizing the mean square prediction error is equivalent to minimizing the trace of 1 22|1 22 2111 12 (6) This can be seen since the retained variability of a subset can be measured using n trace( 22|1 ) Retained Variability 1 100 % i2 (7) n i 1 i2 i 1 where i is the standard deviation of the i’th feature. This method is very appealing since it does satisfy well defined properties. Its shortcoming is in the complexity of finding the subset. It is not computationally feasible to find this subset for a large feature vector since all of the n possible combinations have to be looked at. So q suppose we are looking to find a set of 10 variables out of 20, it will involve computing either one of the measures for 184756 possible combinations. Another method proposed in [5] uses the Principal Components themselves. The coefficient of each principal component can give an insight to the effect of each variable on that axes of the transformed space. If the i’th coefficient of one of the Principal Component is high compared to the others, it implies that the xi element of X is very dominant in that axes/PC. By choosing the variables corresponding to the highest coefficients of each of the first q PC’s, we maintain a good estimate of the same properties as the PC transformation. This method is a very intuitive method, and does not involve heavy computations, but since it considers each PC independently, variables with the same information might be chosen causing a lot of redundant information in the obtained subset. This method was effectively used in applications where the PC coefficients are discretized based on the highest coefficient values. An example of this can be seen in [8], where the authors project a feature set (Optical Flow of lip tracking) to a lower dimensional space using PCA, but in fact use a simple linear combination of the original feature determined by setting the highest coefficients of the chosen Principal Components to 1 corresponding to just a few of the features. Another method proposed in [6] and [7] chooses a subset of size q by computing its PC projection to a smaller dimensional space, and minimizing measure given using the Procrustes Analysis [9]. This method helps reduce the redundancy of information, but again involves high computations since many combinations of subsets are explored. In our method, we too exploit the information which can be inferred by the PC coefficients to obtain the optimal subset of features, but unlike the method in [5], we will use all of the PC’s together to gain a better insight on the structure of our original features so we can choose variables without any redundancy of information. In the next section we will describe this method. 3. Principal Feature Analysis 3.1 Motivation To explain the motivation behind the analysis, let us observe a very simple example. Let X x1 x 2 x3 x 4 T be a zero mean normally distributed random feature vector. The variables are constructed in the following manner: (a) x1 , x 2 are statistically independent (b) x3 x1 n1 (c) x 4 x1 x 2 n 2 , where n1, n2 are a zero mean normally distributed random variables with a much lower variance then x1 and x2. Suppose we want to choose a subset of variables from the feature vector. Knowing the statistics, it is quite evident that choosing x2 and either x1 or x3 would give a very good representation of the entire data, since x1 and x3 are very highly correlated, and x4 has a mixture of information from both x2 and the set {x1,x3}. This information can be extracted from the Principal Components of the data. To see this we compute the Principal components, and see that the first two PC’s retain 99.8% of the variability in the data. Let a11 a12 a a 22 A2 21 a 31 a 32 a 41 a 42 be the matrix whose columns are the first two Principal components corresponding to the highest eigenvalues of the covariance matrix. The coefficients of the row vectors of this matrix can be viewed as the weights of a single feature on the lower dimensional space, i.e., if aij is high, the i’th feature has a lot of effect on the projection of the data to the j’th lower dimensional axes. If we look at the structure of these vectors with respect to the other row vectors, we can see that they will be very similar for the features which are highly correlated, and very different for features which are not correlated. Figure 1 shows the plot of the four row vectors of this example. As can be seen from the figure, the row vectors corresponding to x1 and x3 are very near, the vector corresponding to x2 is far from them, and the vector corresponding to x4 is somewhere in between. Knowing that we have chosen a 2-Dimensional subspace, if we cluster these vectors to 2 clusters (say {x2,x4}, and {x1,x3}) we could choose one variable from each cluster using some metric to decide on the best feature. If we choose the vectors in each cluster that are farthest apart from the other cluster, the natural choice would be to choose x2 and x1. Choosing this combination retains 98% of the variability in the data, computed using equation (7). Choosing x4 however will not yield good representation of the data (90% of the variability). Although x4 holds information on all of the variables, it has no discriminate power in the subspace, and choosing it with any of the other will result in a lot of redundant information. This simple example shows that the structure of the row vectors corresponding to the first q PC’s carry information on the dependencies between features and in the lower dimensional space. The number of features which should be chosen should be at least the dimension of the subspace q, and it has been shown that a slightly higher number of features should be taken in order to achieve the same performance as the q dimensional PCA [4,5]. 1 0.8 x2 0.6 0.4 x4 0.2 0 -0.2 x3 x1 -0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure 1: Plot of the PC coefficients corresponding to the four variables. The X-axis = first PC, Y-Axis = second PC 3.2 Principal Feature Analysis: The Algorithm Let X be a zero mean n-dimensional random feature vector. Let be the covariance matrix of X (which could be in correlation form as well). Let A be a matrix whose columns are the orthonormal eigenvectors of the matrix , computed using the singular value decomposition of the : (8) AAT 1 . 0 , ... where 2 n 1 0 . n and AT A I n Let Aq be the first q columns of A and let V1 , V ,...Vn q be the rows of the matrix Aq. The vector Vi corresponds to the i’th feature (variable) in the vector X, and the coefficients of Vi correspond to the weights of that feature on each axes of the subspace. Features that are highly correlated will have similar absolute value weight vectors (absolute value is necessary since changing the sign of one variable changes the signs of the corresponding weights but has no statistical significance [5]). In order to find the best subset we will use the structure of these rows to first find the features which are highly correlated to each other and then choose from each group of correlated features the one which will represent that group optimally in terms of high spread in the lower dimension, reconstruction and insensitivity to noise. The algorithm can be summarized in the following five steps: Step 1 Compute the Sample Covariance matrix, or use the true covariance matrix if it is available. In some cases it would be preferred to use the Correlation matrix instead of the Covariance matrix. The Correlation matrix is defined as the nxn matrix whose i,j’th entry is E xi x j (9) ij E[ x i2 ]E[ x 2j ] This representation is preferred in cases where the features have very different variances from each other, and using the regular covariance form will cause the PCA to put very heavy weights on the features with the highest variances. See [5] for more details. Step 2 Compute the Principal components and eigenvalues of the Covariance/Correlation matrix as defined in equation (8). Step 3 Choose the subspace dimension q and construct the matrix Aq from A. This can be chosen by deciding how much of the variability of the data is desired to be retained. The retained variability can be computed using: q Variability Re tained i i 1 n 100 % (10) i nearest to the mean is twofold. This feature can be thought of as the central feature of that cluster- the one most dominant in it, and which holds the least redundant information of features in other clusters. Thus it satisfies both of the properties we wanted to achieve- large “spread” in the lower dimensional space, and good representation of the original data. Note that if a cluster has just two components in it, the chosen feature will be the one which has the highest variance of the coefficients, since this means that it has better separability power in the subspace. By choosing the principal features using this algorithm, we choose the subset that represents well the entire feature set both in terms of retaining the variations in the feature space (by using the clustering procedure), and keep the prediction error at a minimum (by choosing the feature who’s vector is closest to the mean of the cluster). The complexity of the algorithm is of the order of performing the PCA, since the K-Means algorithm is applied to just n vectors and will normally converge after very few iterations. This is somewhat more expensive from the method proposed in [5], but it takes into account the information given by the entire set of chosen Principal components thus finding better the interdependencies of the features. The method does not optimize the criteria given in [4], but it does approximates them and allows application of the method to large sets of features, a task which is impossible for the method in [4]. An important byproduct of this method is the ability to do prediction of the original features, with a known accuracy by using the LS estimate of them given by Iq ~ (11) Xˆ 1 X 2111 ~ where X is the chosen subset of features. i 1 Step 4 Cluster the vectors | V1 |, | V2 |,...,| Vn | q to p q clusters using K-Means algorithm [15]. The distance measure used for the K-Means algorithm is the Euclidean distance. The vectors are clustered in p clusters and the means of each cluster is computed. This is an iterative stage which repeats itself until the p clusters are found and do not change. The reason to choose p greater then q in some cases is if the same retained variability as the PCA is desired, a slightly higher number of features is needed (Usually 1-5 more are enough). Step 5 In each cluster, find the corresponding vector Vi which is closest to the mean of the cluster. Choose the corresponding feature xi as a principal feature. This will yield the choice of p features. The reason for choosing the vector 4. Experiments To show the results of the Principal Feature Analysis, we apply the method to two examples of possible applications. 4.1 Facial Motion Feature Points Selection The first is selection of the important points which should be tracked on a human face in order to account for the non-rigid motion. This is a classic example of the need to do feature selection since it can be very expensive, and maybe impossible to track many points on the face reliably. Methods using optical flow [11],[ 8], or methods using patches [12], which have been proposed, could benefit by allowing selection of fewer meaningful features and thus decrease the error and time of the tracking. In addition, prediction of the remaining features can be performed using the LS estimation. This can be used for driving an animated figure based on the tracking results. Since the tracking of the points in the development stage has to be accurate, we used markers located on many facial points. Tracking of these labels is done automatically using template matching techniques, and the results are checked manually for error correction. Thus we have reliable tracking result for the entire video sequence (60 sec at 30 frames per second) of human facial motion performing several normal actions- smiling, frowning, acting surprised, and talking. We estimate the 2-D nonrigid facial motion vector for each feature point over the entire sequence after accounting for the global head motion using stationary points (nose tip). The images in Figure 2 demonstrate some facial expressions that appear in the video sequence. In order to avoid singularity of the covariance matrix, time periods that have no motion at all are not taken into account. There are a total of 40 facial points that are being tracked. For the Principal feature analysis, the points are split to two groups, upper face (eyes and above), and lower face. Each point is represented by its horizontal and vertical direction, and therefore the actual number of features we have is 80. We compute the Correlation matrix of each group after subtracting its Mean, and apply the Principal feature analysis to choose the important features (points and direction of motion), while retaining 90% of the variability in the data. Figure 3 shows the results of the analysis. The chosen features are marked by arrows displaying the principal direction chosen for that feature point. It can be seen that the chosen features correspond to physical based models of the face, i.e., vertical motion was chosen for the middle lip point, with more features chosen around the lips then other lower facial regions. Vertical motion features were chosen for the upper part of the face (with the inner eyebrows chosen to represent the horizontal motion which appeared in the original sequence). This implies that much of the lower-face region’s motion can be inferred using mainly lip motion tracking (an easier task from the practical point of view). In the upper part of the face, points on the eyebrows were chosen, and in the vertical direction mostly, which is in agreement with the physical models. It can also be seen that less points are needed for tracking in the upper part of the face (7 principal motion points) then the lower part of the face (9 principal motion points) since there are fewer degrees of freedom in the motion of the upper part of the face. This analysis is comparable to the classic facial motion studies made by Ekman [16] in which facial movement displaying emotions were mapped to Action Units, corresponding to facial muscles movements. The example shows that the Principal Feature analysis can model a difficult physical phenomenon such as the face motion to reduce complexity of existing algorithms by saving the necessity of measuring all of the features. This is in contrast to PCA which will need the measurements of all of the original motion vector to do the same modeling. 4.2 Dimensionality Reduction of Faces The second experiment will show the performance of Principal Feature analysis for reducing the dimensionality of a frontal face image. We will also compare face recognition results using Principal Feature analysis, PCA and the method described in [5]. Face recognition schemes have used PCA to reduce the dimensionality of the feature vector, which is just the original image of a person’s face put into vector form by concatenating the rows [10]. Another space reduction method used was Fisher Linear Discriminant [13]. These methods were used because of the realization that discrimination in the high dimension suffers from ‘the curse of dimensionality’ problem and is computationally very expensive. We used a set of 93 facial images of different people of size 33x32 taken from the Purdue Face Database [14]. There were 51 males and 42 female subjects, with and without glasses, with and without beards, taken under the same strict lighting conditions, with a neutral facial expression. To make the problem somewhat more realistic, the images were not perfectly aligned as can be seen in the examples shown in Figure 4 (a). The result of the Principal Feature Analysis can be seen in Figure 4 (1). It shows the pixels that were chosen highlighted on one of the faces from the database. There were 56 pixels chosen (out of 1056), accounting for 90% of the variability. The loss in terms of variability using the same number of Principal components was 5%. This is a slightly higher number than expected, but it is caused by the imperfection of the data. Figure 4 (b) shows the reconstruction of the some of the images using those 56 pixels and the prediction transformation matrix given by Equation (12). It can be seen that the reconstruction appears accurate, and even subjects with glasses (10 out of the entire set), or beards (2 out of the set) were reconstructed well. Figure 4(c) shows the reconstruction using the method in [5], and it can be seen that there is visible loss of details, especially around the mouth of the subjects. Table 1 shows the mean-square error measured over the entire set for the 3 methods. Again, as expected, the Principal feature method outperforms the method in [5], and is comparable (although higher) then the PCA analysis using 56 PC’s, and is exactly the same as the error using 45 PC’s. As can be seen in Figure 4, this difference is not apparent in the reconstructed images. As a second part of the experiment, we used the principal feature analysis to do recognition on the faces in the database. We used a second set of the faces of the same people as the training set (93 people). The faces in the second set were smiling. We used the NearestNeighbor classifier. As can be seen in Table 1, the recognition rate using Principal Feature Analysis is the same as the recognition rate using PCA with 56 PC’s (94%), and slightly higher than PCA using 45 PC’s (92%). In contrast to this, the method of [5] performed poorly on the test set (78%). Checking the faces where the errors were made showed that the Principal Feature Analysis made the errors for the same faces as PCA, and gave the same wrong classification results for all but one of these errors. This strengthens the claim that the method approximates well the Principal Component Analysis. The advantage of using the Principal Feature analysis over PCA in this case is that although we are measuring all of the features anyways, the transformation to the lower dimensional space involves no computations at all, a fact which can speed up face recognition without loss of accuracy. 5. Summary In this paper we have proposed a novel method for dimensionality reduction using Principal Feature Analysis. By exploiting the structure of the principal components of a set of features, we can choose the principal features which retain most of the information both in the sense of maximum variability of the features in the lower dimensional space and in the sense of minimizing the reconstruction error. We have shown that this method can give results equivalent to PCA, or with a comparable small difference. This method should be used in the cases where the cost of using all of the features can be very high, either in terms of computational cost, measurement cost or in terms of reliability of the measurements. The most natural application of this method is in the development stage, and can be used to pick out a feature vector out of a large possible feature vector, as shown in the first experiment. It can also be used for reconstruction of the original large set of features, which can be used for the first experiment to drive an animated face, a task which normally needs many facial points as an input. But, as we have shown in the second example, it can also be used in broader span of applications, like the dimensionality reduction used for appearance based face recognition. 6. References [1] Lisboa, P.J.G., Mehri-Dehnavi, R., “Sensitivity Methods for Variable Selection Using the MLP”, International Workshop on Neural Networks for Identification, Control, Robotics and Signal/Image, 1996, pp. 330-338. [2] Lin, T.-S., Meador, J., “Statistical Feature Extraction and Selection for IC Test Pattern Analysis”, Circuits and Systems, 1992, pp.391-394 vol 1. [3] Hocking, R.R, “Development in Linear Regression Methodology: 1959-1982”, Technometrics, 1983, pp. 219-249 vol. 25 [4] McCabe, G.P., “Principal Variables”, Technometrics, 1984, pp. 137-134 vol.26. [5] Jolliffe, I.T., Principal Component Analysis, SpringerVerlag, New-York, 1986. [6] Krzanowski, W.J., “Selection of Variables to Preserve Multivariate Data Structure, Using Principal Component Analysis”, Applied Statistics- Journal of the Royal Statistical Society Series C, 1987, pp. 22-33 vol. 36 [7] Krzanowski, W.J., “A Stopping Rule for structurePreserving Variable Selection”, Statistics and Computing, March 1996, pp. 51-56 vol. 6. [8] Mase, K, Pentland, A, “Automatic Lipreading by OpticalFlow Analysis”, Systems & Computers in Japan, 1991, pp. 6776 vol. 22 no. 6 [9] Gower, J.C., “Statistical Methods of Comparing Different Multivariate Analyses of the Same Data”, In Mathematics in the Archaeological and Historical Sciences (F.R. Hodson, D.G. Kendall and P. Tautu editors) , University Press, Edinburgh, 1971, pp.138-149 [10] Turk, M.A., Pentland, A., “Face Recognition Using EigenFaces”, Proceedings of CVPR, 1991, pp. 586-591. [11] DeCarlo D, Metaxas D, ”Combining Information Using Hard Constraints”, Proceedings of CVPR, 1999, pp. 132-138 vol. 2. [12] Tao H, Huang T.S., “Bezier Volume Deformation Model for Facial Animation and Video Tracking”, International Workshop on Motion Capture Techniques for Virtual Enviroments, 1998, pp. 242-253. [13] Belhumeur P.N., Hespanha J.P., Kriegman D.J., “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection”, IEEE Trans. On Pattern Analysis & Machine Intelligence, July 1997, pp. 711-720 vol. 19 no. 7. [14] A. Martinez and R. Benavente. The AR Face Database. CVC Technical Report #24, June 1998. [15] Arabie, P, Hubert, L.J, De Soete, G. editors, Clustering and Classification, World Scientific, River Edge, NJ, 1998. [16] Ekman P., Emotion in the Human Face, Cambridge University Press, New-York, 1982. Figure 2. Examples of Facial Expressions Figure 3. Principal Motion Features chosen for the facial points. The arrows show the motion direction chosen (Horizontal or Vertical) (1) (a) (b) (c) (d) Figur e 4: (1) Principal pixels shown highlighted on one of the faces. (a) Original Face Images, (b) Reconstructed Images using Principal Feature Analysis, (c) Reconstructed images using Largest Coefficient Method [5], (d) Reconstructed Images using first 56 Principal Components Principal Feature Analysis (56 Pixels) Mean Square Error Variability Retained Recognition Rate 15 90% 94% Largest Coefficient Method [5] (56 Pixels) 18 85% 78% PCA using 56 PC’s PCA using 45 PC’s 11 95% 94% 15 90% 92% Table 1: Error Comparison between the methods