Principal Components – I • When a very large number p of variables is measured on each sample unit, interpreting results of analyses might be difficult. • It is often possible to reduce the dimensionality of the data by finding a smaller set k of linear combinations of the original p variables that preserve most of the variability across sample units. • These linear combinations are called principal components. • Finding the principal components (or PCs) is often an early step in a complex analysis. Scores for sample PCs can be used as response variables to fit MANOVA or regression models, to cluster sample units or to build classification rules. • Principal components can also be used in diagnostic investigations to find outliers or high leverage cases. 573 Principal Components - I • Geometrically, PCs are a new coordinate system obtained by rotating the original system which has X1, X2, ..., Xp as the coordinate axes. • The new set of axes represent the directions with maximum variability and permit a more parsimonious description of the covariance matrix (essentially by ignoring coordinates with little variability). • PCs are derived from the eigenvectors of a covariance matrix Σ or from the correlation matrix ρ, and do not require the assumption of multivariate normality. • If multivariate normality holds, however, some distributional properties of PCs can be established. 574 Population Principal Components • Let the random vector X = (X1, X2, · · · , Xp)0 have covariance matrix Σ with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. • Consider p linear combinations Y1 = a01X = a11X1 + a12X2 + · · · + a1pXp Y2 = a02X = a21X1 + a22X2 + · · · + a2pXp . ... ... ... ... ... Yp = a0pX = ap1X1 + ap2X2 + · · · + appXp • We have that Var(Yi) = a0iΣai, Cov(Yi, Yk ) = a0iΣak , i = 1, 2, ..., p i, k = 1, 2, ..., p. 575 Population Principal Components • Principal Components are uncorrelated linear combinations Y1, Y2, ..., Yp determined sequentially, as follows: – The first PC is the linear combination Y1 = a01X that maximizes Var(Y1) = a01Σa1 subject to a01a1 = 1. – The second PC is the linear combination maximizes Var(Y2) = a02Σa2 subject to Cov(Y1, Y2) = a01Σa2 = 0. ... – The ith PC is the linear combination maximizes Var(Yi) = a0iΣai subject to Cov(Yi, Yk ) = a0iΣak = 0 for k < i. ... Y2 = a02a2 a02X that = 1 and Yi = a0iX that a0iai = 1 and 576 Population Principal Components • Let Σ have eigenvalue-eigenvector pairs (λi, ei) with with λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. Then, the ith principal component is given by Yi = e0iX = ei1X1 + ei2X2 + · · · + eipXp, i = 1, ..., p. • Then, Var(Yi) = e0iΣei = e0iλiei = λi, since e0iei = 1 Cov(Yi, Yk ) = e0iΣek = e0iλk ek = 0, since e0iek = 0 • Proof: See section 8.2 in the textbook. Proof uses the maximization of quadratic forms that we studied in Chapter 2, with B = Σ. 577 Population Principal Components • Using properties of the trace of Σ we have σ11+σ22+· · ·+σpp = X i Var(Xi) = λ1+λ2+· · ·+λp = X Var(Yi). i • Proof: We know that σ11 + σ22 + · · · + σpp = tr(Σ), and we also know that since Σ is positive definite, we can decompose Σ = P ΛP 0, where Λ is the diagonal matrix of eigenvalues and P is orthonormal with columns equal to the eigenvectors of Σ. Then tr(Σ) = tr(P ΛP 0) = tr(ΛP 0P ) = tr(Λ) = λ1 + λ2 + · · · + λp. 578 Population Principal Components P • Since the total population variance, i σii, is equal to the P sum of the variances of the principal components, i λi, we say that λk λ1 + λ2 + ... + λp is the proportion of the total variance associated with (or explained by) the kth principal component. • If a large proportion of the total population variance (say 80% or 90%) is explained by the first k PCs, then we can ignore the original p variables and restrict attention to the first k PCs without much loss of information. 579 Population Principal Components • The correlation between the ith PC and the kth original variable, √ eik λi ρYi,Xk = √ σkk is a measure of the contribution of the kth variable to the ith principal component. • An alternative is to consider the coefficients eik to decide whether variable Xk contributes greatly or not to Yi. • Ranking the p original variables Xk with respect to their contribution to Yi using the correlations or the coefficients eik results in similar conclusions when the Xk have similar variances. 580 PCs from Multivariate Normal Populations • Suppose that X ∼ Np(µ, Σ). The density of X is constant on ellipsoids centered at µ that are determined by (x − µ)0Σ−1(x − µ) = c2, √ with axes ±c λiei, i = 1, ..., p, with (λi, ei) the eigenvalueeigenvector pairs of Σ. • Assume for simplicity that µ = 0, so that 1 0 2 1 0 2 1 0 2 2 0 −1 (e x) + (e x) + · · · + (epx) , c =xΣ x= λ1 1 λ2 2 λp using the decomposition of Σ−1. Notice that e0ix is the ith PC of x. 581 PCs from Multivariate Normal Populations • If we let yi = e0ix we can write 1 2 1 2 1 2 2 c = y1 + y2 + · · · + yp , λ1 λ2 λp which also defines an ellipse where the coordinate system has axes y1, y2, ..., yp lying in the directions of the eigenvectors. • Thus, y1, y2, ..., yp are parallel to the directions of the axes of a constant density ellipsoid. • Any point on the ith ellipsoid axis has x coordinates proportional to e0i and has principal component coordinates of the form [0, 0, ..., yi, ...0]. • If µ 6= 0 then the mean-centered PC yi = e0i(x − µ) has mean 0 and lies in the direction of ei. 582 Two examples of PCs from MVN data 583 PCs from Standardized Variables • When variables are measured on different scales is useful to standardize the variables before extracting the PCs. If Zi = (Xi − µi) , √ σii then Z = (V 1/2)−1(X − µ) is the p−dimensional vector of standardized variables. The matrix V 1/2 is diagonal with √ σii elements. • Note that Cov(Z) = Corr(X), the correlation matrix of the original variables X. • If now (λi, ei) denote the eigenvalue-eigenvector pairs of Corr(X), the PCs of Z are given by Yi = e0iZ = e0i(V 1/2)−1(X − µ), i = 1, ..., p. 584 PCs from Standardized Variables • Proceeding as before, p X i=1 Var(Yi) = p X Var(Zi) = p i=1q ρYi,Zk = eik λi, i, k = 1, ..., p. Further, Proportion of standardized λk , population variance due = p to kth principal component k = 1, ..., p, where the λi are the eigenvalues of ρ. 585 PCs from Standardized Variables • In general, the PCs extracted from Σ=Cov(X) and from Corr(X) will not be the same. • The two sets of PCs are not even functions of one another, so standardizing variables has important consequences. • When should one standardize before computing PCs? 586 PCs from Standardized Variables • If variables are measured on very different scales (e.g. patient weights in kg vary from 40 to 100, protein concentration in ppm varying between 1 and 10), then the variables with the larger variances will dominate. • When one variable has a much larger variance than any of the other variables, we will end up with a single PC that is essentially proportional to the dominating variable. In some cases, this might be desirable, but in several cases this may not be, in which case it would make sense to use the correlation matrices. 587 PCs from Uncorrelated Variables • If variables X are uncorrelated, then Σ is diagonal with elements σii. • The eigenvalues in this case are λi = σii, if the σiis are in decreasing order h and one choice fori0the corresponding eigenvectors is ei = 0 ... 0 1 0 ...0 . • Since e0iX = Xi we note that the PCs are just the original variables. Thus, we gain nothing by trying to extract the PCs when the Xi are uncorrelated. • If the uncorrelated variables are standardized, the resulting PCs are just the original standardized variables Zi. • The above argument changes somewhat if the variances in Σ are not in decreasing order: in that case, ei’s together form an appropriate permutation matrix. 588 Sample Principal Components • If x1, x2, ..., xn is a sample of p−dimensional vectors from a distribution with (µ, Σ), then the sample mean vector, sample covariance matrix and sample correlation matrix are x̄, S, R, respectively. • Eigenvalue-eigenvector pairs of S are denoted (λ̂i, êi) and the ith sample PC is given by ŷi = ê0ix = êi1x1 + êi2x2 + · · · + êipxp, i = 1, ..., p. • The sample variance of the ith PC is λ̂i and the sample covariance between (ŷi, ŷk ) is zero for all i 6= k. • The total sample variance s11 + s22 + ... + spp is equal to λ̂1 + ... + λ̂p and the relative contribution of the kth variable to the ith sample PC is given by rŷi,xk . 589 Sample Principal Components • Observations are sometimes centered by subtracting the mean: (xj − x̄). • The PCs in this case are ŷi = êi(x − x̄) where again êi is the ith eigenvector of S. • If ŷji is the score on the ith sample PC for the jth sample unit, then we note that the sample mean of the scores for the ith sample PC (across sample units) is zero: n n 1 X 1 0 X 1 0 ¯ ŷ i = êi(xj − x̄) = êi (xj − x̄) = ê0i0 = 0. n j=1 n j=1 n • The sample variance of the ith sample PC is λ̂i, the ith largest eigenvalue of S. 590 PCs from the MLE of Σ 1 Pn (x − x̄)(x − x̄)0 • If data are normally distributed, Σ̂ = n i j=1 i is the MLE of Σ, and for (δ̂, ê) ,the eigenvalue-eigenvector pairs of Σ̂, the PCs ŷji = êixj , i = 1, ..., p, j = 1, ..., n are the MLEs of the population principal components. • Also the eigenvalues of Σ̂ are n−1 δ̂i = λ̂i, n the eigenvalues of S and the eigenvectors are the same. Then, both S and Σ̂ give the same PCs with the same proportion of total variance explained by each. 591