Principal Components – I

advertisement
Principal Components – I
• When a very large number p of variables is measured on each
sample unit, interpreting results of analyses might be difficult.
• It is often possible to reduce the dimensionality of the data
by finding a smaller set k of linear combinations of the original p variables that preserve most of the variability across
sample units.
• These linear combinations are called principal components.
• Finding the principal components (or PCs) is often an early
step in a complex analysis. Scores for sample PCs can be
used as response variables to fit MANOVA or regression models, to cluster sample units or to build classification rules.
• Principal components can also be used in diagnostic investigations to find outliers or high leverage cases.
573
Principal Components - I
• Geometrically, PCs are a new coordinate system obtained by
rotating the original system which has X1, X2, ..., Xp as the
coordinate axes.
• The new set of axes represent the directions with maximum
variability and permit a more parsimonious description of the
covariance matrix (essentially by ignoring coordinates with
little variability).
• PCs are derived from the eigenvectors of a covariance matrix
Σ or from the correlation matrix ρ, and do not require the
assumption of multivariate normality.
• If multivariate normality holds, however, some distributional
properties of PCs can be established.
574
Population Principal Components
• Let the random vector X = (X1, X2, · · · , Xp)0 have covariance
matrix Σ with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0.
• Consider p linear combinations
Y1 = a01X = a11X1 + a12X2 + · · · + a1pXp
Y2 = a02X = a21X1 + a22X2 + · · · + a2pXp
.
...
...
...
...
...
Yp = a0pX = ap1X1 + ap2X2 + · · · + appXp
• We have that
Var(Yi) = a0iΣai,
Cov(Yi, Yk ) = a0iΣak ,
i = 1, 2, ..., p
i, k = 1, 2, ..., p.
575
Population Principal Components
• Principal Components are uncorrelated linear combinations
Y1, Y2, ..., Yp determined sequentially, as follows:
– The first PC is the linear combination Y1 = a01X that maximizes Var(Y1) = a01Σa1 subject to a01a1 = 1.
– The second PC is the linear combination
maximizes Var(Y2) = a02Σa2 subject to
Cov(Y1, Y2) = a01Σa2 = 0.
...
– The ith PC is the linear combination
maximizes Var(Yi) = a0iΣai subject to
Cov(Yi, Yk ) = a0iΣak = 0 for k < i.
...
Y2 =
a02a2
a02X that
= 1 and
Yi = a0iX that
a0iai = 1 and
576
Population Principal Components
• Let Σ have eigenvalue-eigenvector pairs (λi, ei) with with
λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. Then, the ith principal component is
given by
Yi = e0iX = ei1X1 + ei2X2 + · · · + eipXp,
i = 1, ..., p.
• Then,
Var(Yi) = e0iΣei = e0iλiei = λi, since e0iei = 1
Cov(Yi, Yk ) = e0iΣek = e0iλk ek = 0, since e0iek = 0
• Proof: See section 8.2 in the textbook. Proof uses the
maximization of quadratic forms that we studied in Chapter
2, with B = Σ.
577
Population Principal Components
• Using properties of the trace of Σ we have
σ11+σ22+· · ·+σpp =
X
i
Var(Xi) = λ1+λ2+· · ·+λp =
X
Var(Yi).
i
• Proof: We know that σ11 + σ22 + · · · + σpp = tr(Σ), and we
also know that since Σ is positive definite, we can decompose
Σ = P ΛP 0, where Λ is the diagonal matrix of eigenvalues and
P is orthonormal with columns equal to the eigenvectors of
Σ. Then
tr(Σ) = tr(P ΛP 0) = tr(ΛP 0P ) = tr(Λ) = λ1 + λ2 + · · · + λp.
578
Population Principal Components
P
• Since the total population variance, i σii, is equal to the
P
sum of the variances of the principal components, i λi, we
say that
λk
λ1 + λ2 + ... + λp
is the proportion of the total variance associated with (or
explained by) the kth principal component.
• If a large proportion of the total population variance (say
80% or 90%) is explained by the first k PCs, then we can
ignore the original p variables and restrict attention to the
first k PCs without much loss of information.
579
Population Principal Components
• The correlation between the ith PC and the kth original variable,
√
eik λi
ρYi,Xk = √
σkk
is a measure of the contribution of the kth variable to the
ith principal component.
• An alternative is to consider the coefficients eik to decide
whether variable Xk contributes greatly or not to Yi.
• Ranking the p original variables Xk with respect to their
contribution to Yi using the correlations or the coefficients
eik results in similar conclusions when the Xk have similar
variances.
580
PCs from Multivariate Normal Populations
• Suppose that X ∼ Np(µ, Σ). The density of X is constant on
ellipsoids centered at µ that are determined by
(x − µ)0Σ−1(x − µ) = c2,
√
with axes ±c λiei, i = 1, ..., p, with (λi, ei) the eigenvalueeigenvector pairs of Σ.
• Assume for simplicity that µ = 0, so that
1 0 2
1 0 2
1 0 2
2
0
−1
(e x) +
(e x) + · · · + (epx) ,
c =xΣ x=
λ1 1
λ2 2
λp
using the decomposition of Σ−1. Notice that e0ix is the ith
PC of x.
581
PCs from Multivariate Normal Populations
• If we let yi = e0ix we can write
1 2
1 2
1 2
2
c =
y1 +
y2 + · · · + yp ,
λ1
λ2
λp
which also defines an ellipse where the coordinate system has
axes y1, y2, ..., yp lying in the directions of the eigenvectors.
• Thus, y1, y2, ..., yp are parallel to the directions of the axes of
a constant density ellipsoid.
• Any point on the ith ellipsoid axis has x coordinates
proportional to e0i and has principal component coordinates
of the form [0, 0, ..., yi, ...0].
• If µ 6= 0 then the mean-centered PC yi = e0i(x − µ) has mean
0 and lies in the direction of ei.
582
Two examples of PCs from MVN data
583
PCs from Standardized Variables
• When variables are measured on different scales is useful to
standardize the variables before extracting the PCs. If
Zi =
(Xi − µi)
,
√
σii
then Z = (V 1/2)−1(X − µ) is the p−dimensional vector of
standardized variables. The matrix V 1/2 is diagonal with
√
σii elements.
• Note that Cov(Z) = Corr(X), the correlation matrix of the
original variables X.
• If now (λi, ei) denote the eigenvalue-eigenvector pairs of Corr(X),
the PCs of Z are given by
Yi = e0iZ = e0i(V 1/2)−1(X − µ),
i = 1, ..., p.
584
PCs from Standardized Variables
• Proceeding as before,
p
X
i=1
Var(Yi) =
p
X
Var(Zi) = p
i=1q
ρYi,Zk = eik λi,
i, k = 1, ..., p.
Further,


Proportion of standardized
λk


,
 population variance due  =
p
to kth principal component
k = 1, ..., p,
where the λi are the eigenvalues of ρ.
585
PCs from Standardized Variables
• In general, the PCs extracted from Σ=Cov(X) and from
Corr(X) will not be the same.
• The two sets of PCs are not even functions of one another,
so standardizing variables has important consequences.
• When should one standardize before computing PCs?
586
PCs from Standardized Variables
• If variables are measured on very different scales (e.g. patient
weights in kg vary from 40 to 100, protein concentration in
ppm varying between 1 and 10), then the variables with the
larger variances will dominate.
• When one variable has a much larger variance than any of
the other variables, we will end up with a single PC that is
essentially proportional to the dominating variable. In some
cases, this might be desirable, but in several cases this may
not be, in which case it would make sense to use the correlation matrices.
587
PCs from Uncorrelated Variables
• If variables X are uncorrelated, then Σ is diagonal with
elements σii.
• The eigenvalues in this case are λi = σii, if the σiis are in
decreasing order
h and one choice fori0the corresponding eigenvectors is ei = 0 ... 0 1 0 ...0 .
• Since e0iX = Xi we note that the PCs are just the original
variables. Thus, we gain nothing by trying to extract the
PCs when the Xi are uncorrelated.
• If the uncorrelated variables are standardized, the resulting
PCs are just the original standardized variables Zi.
• The above argument changes somewhat if the variances in Σ
are not in decreasing order: in that case, ei’s together form
an appropriate permutation matrix.
588
Sample Principal Components
• If x1, x2, ..., xn is a sample of p−dimensional vectors from a
distribution with (µ, Σ), then the sample mean vector, sample
covariance matrix and sample correlation matrix are x̄, S, R,
respectively.
• Eigenvalue-eigenvector pairs of S are denoted (λ̂i, êi) and the
ith sample PC is given by
ŷi = ê0ix = êi1x1 + êi2x2 + · · · + êipxp, i = 1, ..., p.
• The sample variance of the ith PC is λ̂i and the sample
covariance between (ŷi, ŷk ) is zero for all i 6= k.
• The total sample variance s11 + s22 + ... + spp is equal to
λ̂1 + ... + λ̂p and the relative contribution of the kth variable
to the ith sample PC is given by rŷi,xk .
589
Sample Principal Components
• Observations are sometimes centered by subtracting the mean:
(xj − x̄).
• The PCs in this case are ŷi = êi(x − x̄) where again êi is the
ith eigenvector of S.
• If ŷji is the score on the ith sample PC for the jth sample
unit, then we note that the sample mean of the scores for
the ith sample PC (across sample units) is zero:
n
n
1 X
1 0 X
1
0
¯
ŷ i =
êi(xj − x̄) = êi
(xj − x̄) = ê0i0 = 0.
n j=1
n j=1
n
• The sample variance of the ith sample PC is λ̂i, the ith
largest eigenvalue of S.
590
PCs from the MLE of Σ
1 Pn (x − x̄)(x − x̄)0
• If data are normally distributed, Σ̂ = n
i
j=1 i
is the MLE of Σ, and for (δ̂, ê) ,the eigenvalue-eigenvector
pairs of Σ̂, the PCs
ŷji = êixj ,
i = 1, ..., p,
j = 1, ..., n
are the MLEs of the population principal components.
• Also the eigenvalues of Σ̂ are
n−1
δ̂i =
λ̂i,
n
the eigenvalues of S and the eigenvectors are the same.
Then, both S and Σ̂ give the same PCs with the same
proportion of total variance explained by each.
591
Download