Principle Component Analysis Jing Gao SUNY Buffalo 1 Why Dimensionality Reduction? • We have too many dimensions – – – – – To reason about or obtain insights from To visualize Too much noise in the data Need to “reduce” them to a smaller set of factors Better representation of data without losing much information – Can build more effective data analyses on the reduced-dimensional space: classification, clustering, pattern recognition 2 Component Analysis • Discover a new set of factors/dimensions/axes against which to represent, describe or evaluate the data • Factors are combinations of observed variables – May be more effective bases for insights – Observed data are described in terms of these factors rather than in terms of original variables/dimensions 3 Basic Concept • Areas of variance in data are where items can be best discriminated and key underlying phenomena observed – Areas of greatest “signal” in the data • If two items or dimensions are highly correlated or dependent – They are likely to represent highly related phenomena – If they tell us about the same underlying variance in the data, combining them to form a single measure is reasonable 4 Basic Concept • So we want to combine related variables, and focus on uncorrelated or independent ones, especially those along which the observations have high variance • We want a smaller set of variables that explain most of the variance in the original data, in more compact and insightful form • These variables are called “factors” or “principal components” 5 Principal Component Analysis • Most common form of factor analysis • The new variables/dimensions – Are linear combinations of the original ones – Are uncorrelated with one another • Orthogonal in dimension space – Capture as much of the original variance in the data as possible – Are called Principal Components 6 Original Variable B What are the new axes? PC 2 PC 1 Original Variable A • Orthogonal directions of greatest variance in data • Projections along PC1 discriminate the data most along any one axis 7 Principal Components • First principal component is the direction of greatest variability (covariance) in the data • Second is the next orthogonal (uncorrelated) direction of greatest variability – So first remove all the variability along the first component, and then find the next direction of greatest variability • And so on … 8 Principal Components Analysis (PCA) • Principle – Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated variables – Map the data into a space of lower dimensionality • Properties – It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables – New axes are orthogonal and represent the directions with maximum variability 9 Algebraic definition of PCs Given a sample of n observations on a vector of p variables x1, x2 ,, xn p define the first principal component of the sample by the linear transformation p z1 a1T x j ai1 xij , j 1,2,, n. i 1 where the vector a1 (a11 , a21 , , a p1 ) x j ( x1 j , x2 j , , x pj ) is chosen such that var[ z1 ] is maximum. 10 Algebraic derivation of PCs To find a1 first note that n 2 1 2 T T var[ z1 ] E (( z1 z1 ) ) a1 xi a1 x n i 1 T 1 n T a1 xi x xi x a1 a1T Sa1 n i 1 where T 1 n S xi x xi x n i 1 n is the covariance matrix. 1 x xi is the mean. n i 1 In the following, we assume the Data is centered. x0 11 Algebraic derivation of PCs Assume Form the matrix: then x0 X [ x1 , x2 ,, xn ] pn 1 S XX T n 12 Algebraic derivation of PCs To find a1 that maximizes var[ z1 ] subject to a1T a1 1 Let λ be a Lagrange multiplier L a1T Sa1 (a1T a1 1) L Sa1 a1 0 a1 Sa1 a1 a1T Sa1 therefore a1 is an eigenvector of S corresponding to the largest eigenvalue 1. 13 Algebraic derivation of PCs To find the next coefficient vector subject to and to cov[ z2 , z1 ] 0 a2 maximizing var[ z2 ] uncorrelated a2T a2 1 cov[ z2 , z1 ] a Sa2 a a2 T 1 T 1 1 then let λ and φ be Lagrange multipliers, and maximize L a Sa2 (a a2 1) a a T 2 T 2 T 2 1 14 Algebraic derivation of PCs We find that a2 whose eigenvalue is also an eigenvector of S 2 is the second largest. In general var[ zk ] a Sak k T k • The kth largest eigenvalue of S is the variance of the kth PC. • The kth PC z k retains the kth greatest fraction of the variation in the sample. 15 Algebraic derivation of PCs • Main steps for computing PCs – Form the covariance matrix S. – Compute its eigenvectors: a p i i 1 – Use the first d eigenvectors d PCs. a d i i 1 to form the – The transformation G is given by G [a1 , a2 ,, ad ] A test point x G x . p T d 16 Dimensionality Reduction Original data reduced data Linear transformation G T Y d d p X p G pd : X Y G X T d 17 Steps of PCA • Let X be the mean vector (taking the mean of all rows) • Adjust the original data by the mean X’ = X – X • Compute the covariance matrix S of adjusted X • Find the eigenvectors and eigenvalues of S. 18 Principal components - Variance 25 Variance (%) 20 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 19 Transformed Data • Eigenvalues j corresponds to variance on each component j • Thus, sort by j • Take the first d eigenvectors ai; where d is the number of top eigenvalues • These are the directions with the largest variances yi1 a1 xi1 x1 yi 2 a2 xi 2 x2 ... ... ... y a x x id d in n 20 An Example X1 X2 19 X1' 63 Mean1=24.1 Mean2=53.8 X2' -5.1 9.25 100 90 39 74 14.9 20.25 80 70 60 50 30 87 Series1 40 5.9 33.25 30 20 10 0 30 23 5.9 -30.75 15 35 -9.1 -18.75 0 10 20 30 40 50 40 30 20 15 43 -9.1 -10.75 10 0 -15 15 32 -9.1 -21.75 -10 -5 -10 Series1 0 5 10 15 20 -20 -30 -40 30 73 5.9 19.25 21 Covariance Matrix • C= 75 106 106 482 • We find out: – Eigenvectors: – a2=(-0.98,-0.21), 2=51.8 – a1=(0.21,-0.98), 1=560.2 22 Transform to One-dimension 0.5 yi 0.4 -10.14 0.3 • We keep the dimension of a1=(0.21,-0.98) • We can obtain the final data as -40 -20 0.2 -16.72 0.1 0 -31.35 -0.1 0 20 31.374 40 16.464 -0.2 -0.3 8.624 -0.4 19.404 -0.5 -17.63 xi1 yi 0.21 0.98 0.21* xi1 0.98 * xi 2 xi 2 23