Principal Component Analysis Jieping Ye Department of Computer Science and Engineering Arizona State University http://www.public.asu.edu/~jye02 Outline of lecture • • • • • What is feature reduction? Why feature reduction? Feature reduction algorithms Principal Component Analysis (PCA) Nonlinear PCA using Kernels What is feature reduction? • Feature reduction refers to the mapping of the original highdimensional data onto a lower-dimensional space. – Criterion for feature reduction can be different based on different problem settings. • Unsupervised setting: minimize the information loss • Supervised setting: maximize the class discrimination • Given a set of data points of p variables x1 , x2 ,, xn Compute the linear transformation (projection) G pd : x p y GT x d (d p) What is feature reduction? Original data reduced data Linear transformation Y d GT d p X p G p d : X Y G T X d High-dimensional data Gene expression Face images Handwritten digits Outline of lecture • • • • • What is feature reduction? Why feature reduction? Feature reduction algorithms Principal Component Analysis Nonlinear PCA using Kernels Why feature reduction? • Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality – Query accuracy and efficiency degrade rapidly as the dimension increases. • The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of disease may be small. Why feature reduction? • Visualization: projection of high-dimensional data onto 2D or 3D. • Data compression: efficient storage and retrieval. • Noise removal: positive effect on query accuracy. Application of feature reduction • • • • • • Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification Outline of lecture • • • • • What is feature reduction? Why feature reduction? Feature reduction algorithms Principal Component Analysis Nonlinear PCA using Kernels Feature reduction algorithms • Unsupervised – – – – Latent Semantic Indexing (LSI): truncated SVD Independent Component Analysis (ICA) Principal Component Analysis (PCA) Canonical Correlation Analysis (CCA) • Supervised – Linear Discriminant Analysis (LDA) • Semi-supervised – Research topic Outline of lecture • • • • • What is feature reduction? Why feature reduction? Feature reduction algorithms Principal Component Analysis Nonlinear PCA using Kernels What is Principal Component Analysis? • Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables – Retains most of the sample's information. – Useful for the compression and classification of data. • By information we mean the variation present in the sample, given by the correlations between the original variables. – The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains. Geometric picture of principal components (PCs) z1 • the 1st PC z1 is a minimum distance fit to a line in X space • the 2nd PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous. Algebraic definition of PCs Given a sample of n observations on a vector of p variables x1 , x2 ,, xn p define the first principal component of the sample by the linear transformation p z1 a1T x j ai1 xij , j 1,2,, n. i 1 where the vector a1 (a11, a21,, a p1 ) x j ( x1 j , x2 j ,, x pj ) is chosen such that var[ z1 ] is maximum. Algebraic derivation of PCs To find a1 first note that n 1 var[ z1 ] E (( z1 z1 ) 2 ) a1T xi a1T x n i 1 2 T 1 n T a1 xi x xi x a1 a1T Sa1 n i 1 where T 1 n S xi x xi x n i 1 1 n is the covariance matrix. x x is the mean. n i 1 In the following, we assume the Data is centered. i x0 Algebraic derivation of PCs Assume Form the matrix: then x0 X [ x1 , x2 ,, xn ] p n 1 S XX T n Obtain eigenvectors of S by computing the SVD of X: X UV T Algebraic derivation of PCs To find a1 that maximizes var[ z1 ] subject to a1T a1 1 Let λ be a Lagrange multiplier L a1T Sa1 (a1T a1 1) L Sa1 a1 0 a1 ( S I p )a1 0 therefore a1 is an eigenvector of S corresponding to the largest eigenvalue 1. Algebraic derivation of PCs To find the next coefficient vector subject to and to First note that cov[ z2 , z1 ] 0 a2 maximizing var[ z2 ] uncorrelated a2T a2 1 cov[ z2 , z1 ] a Sa2 a a2 T 1 T 1 1 then let λ and φ be Lagrange multipliers, and maximize L a Sa2 (a a2 1) a a T 2 T 2 T 2 1 Algebraic derivation of PCs L a Sa2 (a a2 1) a a T 2 T 2 T 2 1 L Sa2 a2 a1 0 0 a2 Sa2 a2 and a Sa2 T 2 Algebraic derivation of PCs We find that a2 whose eigenvalue is also an eigenvector of S 2 is the second largest. In general var[ z k ] a Sak k T k • The kth largest eigenvalue of S is the variance of the kth PC. • The kth PC zk retains the kth greatest fraction of the variation in the sample. Algebraic derivation of PCs • Main steps for computing PCs – Form the covariance matrix S. a d – Use the first d eigenvectors a i i 1 – Compute its eigenvectors: p i i 1 to form the d PCs. – The transformation G is given by G [a1 , a2 ,, ad ] A test point x G x . p T d Optimality property of PCA Dimension reduction X pn GT X d n Original data Reconstruction GT X d n X G(GT X ) pn GT d p Y G T X d n X p n X pn G p d Optimality property of PCA Main theoretical result: The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem: 2 min G pd X G(G X ) subject to G TG I d T F X X 2 F reconstruction error PCA projection minimizes the reconstruction error among all linear projections of size d. Applications of PCA • Eigenfaces for recognition. Turk and Pentland. 1991. • Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. • Probabilistic Disease Classification of ExpressionDependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003. PCA for image compression d=1 d=16 d=2 d=32 d=4 d=64 d=8 d=100 Original Image Outline of lecture • • • • • What is feature reduction? Why feature reduction? Feature reduction algorithms Principal Component Analysis Nonlinear PCA using Kernels Motivation Linear projections will not detect the pattern. Nonlinear PCA using Kernels • Traditional PCA applies linear transformation – May not be effective for nonlinear data • Solution: apply nonlinear transformation to potentially very highdimensional space. : x ( x) • Computational efficiency: apply the kernel trick. – Require PCA can be rewritten in terms of dot product. K ( xi , x j ) ( xi ) ( x j ) More on kernels later Nonlinear PCA using Kernels Rewrite PCA in terms of dot product Assume the data has been centered, i.e., xi 0. i 1 The covariance matrix S can be written as S xi xiT n i Let v be The eigenvector of S corresponding to nonzero eigenvalue 1 1 T T Sv xi xi v v v ( x i v) xi n i n i Eigenvectors of S lie in the space spanned by all data points. Nonlinear PCA using Kernels 1 1 T T Sv xi xi v v v ( x i v) xi n i n i The covariance matrix can be written in matrix form: 1 S XX T , where X [x 1 , x 2 , , x n ]. n v i xi X i 1 Sv XX T X X n 1 T ( X X )( X T X ) ( X T X ) n 1 T ( X X ) n Any benefits? Nonlinear PCA using Kernels : x ( x) Next consider the feature space: 1 T S X X , where X [x 1 , x 2 , , x n ]. n T 1 v i ( xi ) X X X n i The (i,j)-th entry of X Apply the kernel trick: T X is ( xi ) ( x j ) K ( xi , x j ) ( xi ) ( x j ) K is called the kernel matrix. 1 K n Nonlinear PCA using Kernels • Projection of a test point x onto v: ( x) v ( x) i ( xi ) i i ( x) ( xi ) i K ( x, xi ) i i Explicit mapping is not required here. Reference • Principal Component Analysis. I.T. Jolliffe. • Kernel Principal Component Analysis. Schölkopf, et al. • Geometric Methods for Feature Extraction and Dimensional Reduction. Burges.