Uploaded by 張哲綸

PCA note (1)

advertisement
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
Principal Component Analysis
This version: September 27, 2023
1
Dimension Reduction: Encoding and Decoding
Suppose that we have p variables. Denote our data as X, which is an n × p matrix, and
each observation is a p × 1 vector. In dimension reduction, we aim to find a function f (·)
f : Rp → Rq ,
q < p,
that can “encode” an observation xi ∈ Rp into a lower-dimensional space zi = f (xi ) ∈ Rq .
Ideally, we would also like to have a function g(·)
g : Rq → R p ,
that “decodes” the observation back to the original space x̂i = g(zi ). Since the encoder maps observations into a lower-dimensional space, information loss in the process
is inevitable. A good encoder/decoder pair should minimize the reconstruction error L
1∑
L=
||xi − x̂i ||22 .
n i=1
n
Remark 1. For a vector x = (x1 , x2 , ..., xp ) ∈ Rp , the notation ||xi ||l denotes its l-norm:
(
||x||l =
(
p
∑
) 1l
|xj |l
j=1
= |x1 |l + |x2 |l + · · · + |xp |l
) 1l
.
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
For example, when l = 2, it is also known as the L2 norm, which corresponds to the
Euclidean norm that we have been using since high school. Later in this course, we will
also use the L1 norm, denoted as ||x||1 . However, unless otherwise specified, we usually
consider the L2 norm.
How do we find a encoder/ decoder? Consider the following scatter plot of our data
p = 2:
Is our data two-dimensional or one-dimensional?
x2
4
2
x1
−4
−2
2
4
−2
−4
From the graph, we can see that most of our observations are close to the line 45-degree.
So, why not project our data onto the line, which is an one dimensional space. Consider
the following example.
Example 1 (“Eyeball” PCA). Suppose p = 2. From the scatter plot above, it appears that
we can reduce the dimension of our data from p = 2 to q = 1 with minimal information
loss by projecting the data onto the 45-degree line. Specifically, let

A=

√1
 2 ,
√1
2
and consider the function
f (x) = AT x
that maps from R2 to R1 , and
g(z) = Az
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
that maps from R1 to R2 .
You can verify that the encoder f (·) is the function that projects observations onto the
45-degree line, and the compressed observation is given by x̂ = g(z) = g(f (x)), with the
reconstruction error given by ||x − ^
x||. See the figures and table below for an illustration.
Quiz 1. Can you find z ∈ R, x̂ ∈ R2 , and the reconstruction error ||x − x̂|| for x =
(3, 4)T ?
Graph of projection
10 y
5
(−5, 3)
−10
x=y
(3, 4)
(7, 3)
x
−5
5
10
−5
−10
x
AT x
x̂ = Az
||x − x̂||
(3, 4)
√7
2
( 72 , 72 )
√1
2
(7, 3)
10
√
2
(5, 5)
(−5, 3)
−2
√
2
(−1, −1)
√
2 2
√
2 5
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
Graph of x = x̂, ẑ and x̂ − ẑ
x
x̂ = Az
(0, 0)
Example 1 essentially demonstrates how principal component analysis (PCA) operates in
practice. PCA seeks to exploit the correlation between variables to capture the predominant variation in the data. In the extreme case, when the correlation coefficient is close
to 1, two variables are nearly linearly dependent (i.e., y ≈ ax + b for some a, b ∈ R), and
they lie along a line, as illustrated in Example 1.
For p = 2 or p = 3, we have the luxury of visualizing the data to detect whether
there exists a lower-dimensional space (a line or a plane) that adequately represents our
data. However, this visual inspection, or ”eyeballing,” becomes unfeasible for dimensions
beyond p = 3. What is our alternative? To address cases where p > 3, we must generalize
our approach by abstracting the ideas of what we applied in lower dimensions.
2 Beyond p = 3: Deriving PCA from Scratch
As illustrated in Example 1, the core concept of PCA is to identify a lower-dimensional
space onto which our observations can be projected. How can we formalize this concept
for p > 3?
The key to generalization is abstraction. For q = 1, we may characterize PCA as
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
“finding (1) a line to (2) project onto, with the aim of minimizing the (3) reconstruction
error.”How might this statement be generalized for q > 1? Essentially, three questions
need to be addressed:
1. What is the entity onto which we are projecting when q > 1?
2. How is projection defined for q > 1?
3. How is reconstruction defined for q > 1?
What is the entity onto which we are projecting when q > 1? For q = 1, we are
looking for a line; for q = 2, we seek a plane, and so forth. Since one vector is required to
span a line, and two vectors to define a plane, in general, we need q vectors a1 , a2 , ..., aq
to construct a q-dimensional space. In PCA, our goal is to identify these vectors, termed
as the “principal components”, which form the lower-dimensional space. Here, a1 is the
first principal component (PC), a2 is the second PC, and so on.
Project to
q=1
q=2
q=3
Find
a line
a plane
a 3D space
Need vectors (“principal components”)
a 1 ∈ Rp
a 1 , a 2 ∈ Rp
a 1 , a 2 , a 3 ∈ Rp
Remark 2. Note that if the original data is p−dimensional, we have at most p PC’s.
Given that the vectors a1 , a2 , . . . , aq ∈ Rp fundamentally define a lower-dimensional
subspace (of dimension q) in Rp , we impose the conditions:
√
||aj ||2 = aTj aj = 1,
(unit length)
aTj ak = 0,
(orthogonality for j ̸= k)
The principal components a1 , a2 , . . . , aq possess unit length as they fundamentally represent directions. Ensuring their orthogonality, as we will see below, simplifies the definition
of the projection onto the space they span.
Quiz 2. Which of the following vectors can possibly be the first two PC’s for a data with
three variables (p = 3)?
Econ 5166: Data Science and Social Inquiry
 
 
1
3
1. a1 =   , a2 =  
2
2

Intstructor: 陳由常

3. a1 = 

0.5
 , a2 = 
0.5

 
1
0
√ 
 
 
 
2. a1 =  2 , a2 = 0
 
 
0
1

−1


1
 
 
1
0
 
 
 
 
4. a1 = 0 , a2 = 0
 
 
0
1
Now, how do we define project on q > 1?1 It turns out that if we define
[
Ap×q = a1 a2 · · · aq
]
then for an observation xi = (xi1 , xi2 , ..., xip )T ∈ Rp :



 T
a 2 

A T xi = 
 .. 
 . 
 
aTq

aT x
 1 i
aT
 1
· xp×1
q×p
 T 
a2 xi 
q

=
 ..  ∈ R .
 . 


aTq xi
So, the j−th component in AT xi is simply the inner product of aj and xi . Since aj has
unit length, then the projection of xi on aj is simply
aTj xi · aj .
So, the matrix multiplication AT xi encodes xi in terms of its coordinates on a1 , a2 , ..., aq .
In matrix notation, the projection of xi onto the space spanned by a1 , a2 , . . . , aq is then
1
Note that, in math, an understanding of the case for q = 2 would automatically generalize to any
finite-dimensional case. So it is fine to assume q = 2 for the discussion.
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
given by
x̂i = aT1 xi · a1 + aT2 xi · a2 + · · · + aq xi · aq


T
a x
 1 i

[
]  a T xi 

2

= a1 a2 · · · aq 
 .. 
 . 


T
a q xi
= AAT xi .
Summary: to reduce the dimension from p to q, PCA aims to find q vectors a1 , a2 , ..., aq ∈
Rp so that one can construct a q− dimensional space to project our data onto. Our goal
is find the optimal vectors (the principal components) that the reconstruction error is
smallest. That is, we want to solve the minimization problem:
1∑
||xi − AAT xi ||22
a1 ,a2 ,...,ap ∈Rp n
i=1
n
min
s.t. aTj aj = 1, j = 1, 2, ..., q
aTj ak = 0, j ̸= k
where
[
]
Ap×q = a1 a2 · · · aq .
and AAT xi ∈ Rq is the projected (compressed) version of xi ∈ Rp .
Quiz 3. Suppose that p = 3. What is matrix x for the projection on the plane x = y?
3
Solving PCA
So far, we have spent a lot of time defining PCA. But, how can we solve it, i.e., how do
we find the PC’s a1 , a2 , ..., ap ?
PCA attempts to summarize variation in the random vector X with few principal components (PC). In below, we will define what are principal components and explain how
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
to derive them.
In fact, we can solve PCA iteratively, that is, we can first find the first PC, a1 , then a2 ,
..., and aq . So, to find the first PC, we solve
1∑
||xi − a1 aT1 xi ||22
n i=1
n
minp
a1 ∈R
s.t. aT1 a1 = 1.
Given a1 , we can then solve a2 by minimizing the remaining part xi − a1 aT1 xi
1∑
minp
||(xi − a1 aT1 xi ) − a2 aT2 xi ||22
a2 ∈R n
i=1
n
s.t. aT2 a2 = 1,
aT2 a1 = 0.
While we motivated PCA by minimizing reconstruction errors, an alternative definition
of PCA is by “maximizing variances”. For example, the first PC can be found by solving
the following problem:2
max V ar(a′1 X)
a1 ∈Rp
s.t. a′1 a1 = 1.
Remark 3. The restriction a′i ai = 1 is necessary, otherwise V ar(a′i X) can be made
arbitrary large by multiplying the coefficient with a constant.
The rest of the principal components can be defined iteratively. The second PC, P C2 = a′2
is defined by the optimization problem:
max
a2 ∈Rp
V ar(a′2 X)
s.t. aT2 a2 = 1,
aT1 a2 = 0.
Notice that we require the second PC has to be uncorrelated with the first PC. We can
think of it as we are trying to have the second PC explain the variation that is not
explained by the first PC.
2
For a rigor argument, see our textbook (Murphy 2022).
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
Similarly, the jth PC, a′j is the solution to
max
aj ∈Rp
V ar(a′j X)
s.t. aTj aj = 1,
aTj aj ′ = 0 for j ′ = 1, 2 . . . , j − 1.
Remark 4. Recall that
(
a′j x = ⟨aj , x⟩ = aj1 aj2
 
x
 1

)  x2 


. . . ajp 
 ..  = aj1 x1 + aj2 x2 + · · · + ajp xp ,
.
 
xp
so a′ x is a scalar.
3.1
Solve PCA when Σ Diagonal
Remark 5. We can verify that
V ar(aT X) = aT Σa.
See the following example.


 
1 0
1
 and a =   .
Example 2. Let Σ = 
0 2
2

(
)

V ar(aT X) =V ar  1 2 

X1

X2
=V ar(X1 + 2X2 )
=V ar(X1 ) + 4V ar(X2 )
=1 + 4 · 2 = 9
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
Alternatively, we can calculate V ar(aT X) by
 
 
(
)
1 0
1
1
   = 1 2   = 9.
aT Σa = 1 2 
0 2
2
4
(
)

Indeed, we end up with the same result, as the previous remarks implied.
By remark 5, the optimization problem can be written as
max
a1 ∈Rp
aT1 Σa1
s.t. aT1 a1 = 1.
In his book “How to Solve It”, mathematician and Probabilist George Pólya famously
said
“If you can’t solve a problem, then there is an easier problem you can solve:
find it.”
Let’s follow his suggestion by assuming the covariance matrix



Σ=


λ1
0
..
0
.
λp




,
p×p
where λ1 > λ2 > · · · > λp . That is, we assume Σ is now a diagonal matrix with positive,
decreasing diagonal elements.
Now we are ready to solve the optimization problem for P C1 :
maxp
a1 ∈R
aT1 Σa1
s.t. aT a = 1.
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
Notice the objective function is now


aT1 Σa1
(
= a11
)

. . . a1p 

λ1

0
..
0
a
  11 
  .. 
.
 . 
 
λp
a1p
=λ1 a211 + λ2 a212 + · · · + λp a21p .
By redefining b1i = a21i , the maximization problem becomes
max
b
λ1 b11 + · · · + λp b1p
s.t. b1i + · · · + b1p = 1.
It is plain to see that b11 = 1, b12 = · · · = b1p = 0 attains the maximum, while subjecting
to the constraint. Hence, a11 = ±1, a12 = · · · = a1p = 0 is the solution to the P C1
problem.3
The first PC is hence
 
1
 
 
0 

a1 = 
 ..  ,
.
 
0
which points to the direction with largest variance.
Next we solve the P C2 problem
maxp
a2 ∈R
V ar(aT2 X)
s.t. aT2 a2 = 1
aT1 a2 = 0
3
PCs are only uniquely defined up to reflection over the origin. It is not hard to see that if a∗ is a
solution to the PCA problem, then −a∗ is also a solution.
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常


0
The constraint aT1 a2 = 0 implies a21
 
 
a22 

= 0. So a2 = 
 ..  , and the problem reduces to
 . 
 
a2p




0
 

 a 
(
)

  22 
.
T
..
aT Σa2 = 0 a22 . . . a2p 
 . 

  .. 

0
λp  
a2p
 

λ
0
a
  22 
(
) 2

  .. 
...
= a22 . . . a2p 
 . .
 

0
λp
a2p
λ1
0
The problem is now the same as finding P C1 , with one less variable. Hence the solution
to the second component is a22 = ±1, a21 = a23 = · · · = a2p = 0, and the second PC is
a2 = (0, 1, 0, · · · , 0)T . Similarly, we can see that j−th PC are just the unit vectors in Rp
in which whose j−th component is 1 andzero otherwise.
Conclusion: When Xj ’s are uncorrelated, PCA is basically keeping Xj ’s with largest
variances.
3.2 Solve PCA when Σ is not Diagonal
What if Σ is not diagonal? Luckily, we have the following theorem:
Theorem 1 (Real Spectral Theorem). If Σ is symmetric, then there exists a p×p matrix
P such that
Σ = P DP −1 ,
where D is a diagonal matrix, and
P T P = Ip×p ,
i.e., P −1 = P T .
Econ 5166: Data Science and Social Inquiry
Intstructor: 陳由常
Example 3.

 
3
34 12

 = 5
4
12 41
5

−4
50
5 
3
5
0

0

25

3
5
−4
5
4
5
3
5
.
Since Σ is symmetric, it can now be written as Σ = P DP T , where the notation is
followed by the Real Spectral theorem. Let bi = P T ai ,
aTi Σai = aTi P DP T ai = (P T ai )T D(P T ai ) = bTi Dbi .
aTi ai = aTi (P P −1 )ai = aTi (P P T )ai = (P T ai )T (P T ai ) = bTi bi
Restate the maximization problem in terms of bi gives
max
bi ∈Rp
b′i Dbi
s.t. b′i bi = 1,
and we know how to solve it since D is diagonal.
4
Applications of PCA
Here we list examples of applications of PCA.
1. Data compression
2. Summarize data and explorative analysis
3. Use PC in regression
4. Use PC in clustering
5. Factor analysis
Econ 5166: Data Science and Social Inquiry
5
Intstructor: 陳由常
Choosing q
One might ask how much variance is summarized in q, and how we choose such q. A scree
plot might be helpful to answer the question.
Scree plot
Eigenvalue
300
200
100
0
1
5
Component Number
The y-axis of a scree plot is the sorted eigenvalues of the covariance matrix, in a decreasing
order, and the x-axis is the corresponding rank of the eigenvalue. As a rule of thumb, q
is chosen at the elbow of the scree plot, so one might choose q = 2 for the example above.


λ
 1


λ2
Suppose that the covariance matrix Σ = 

..
.









λp
then we have
V ar(P C1 ) = V ar(X1 ) = λ1
V ar(P C1 ) = V ar(X2 ) = λ2
..
.
V ar(P Cp ) = V ar(Xp ) = λp
is a diagonal matrix,
p×p
Download