Principle Component Analysis Jing Gao SUNY Buffalo

advertisement
Principle Component Analysis
Jing Gao
SUNY Buffalo
1
Why Dimensionality Reduction?
• We have too many dimensions
–
–
–
–
–
To reason about or obtain insights from
To visualize
Too much noise in the data
Need to “reduce” them to a smaller set of factors
Better representation of data without losing much
information
– Can build more effective data analyses on the
reduced-dimensional space: classification, clustering,
pattern recognition
2
Component Analysis
• Discover a new set of factors/dimensions/axes
against which to represent, describe or evaluate
the data
• Factors are combinations of observed variables
– May be more effective bases for insights
– Observed data are described in terms of these factors
rather than in terms of original variables/dimensions
3
Basic Concept
• Areas of variance in data are where items can be best
discriminated and key underlying phenomena
observed
– Areas of greatest “signal” in the data
• If two items or dimensions are highly correlated or
dependent
– They are likely to represent highly related phenomena
– If they tell us about the same underlying variance in the
data, combining them to form a single measure is
reasonable
4
Basic Concept
• So we want to combine related variables, and focus
on uncorrelated or independent ones, especially
those along which the observations have high
variance
• We want a smaller set of variables that explain most
of the variance in the original data, in more compact
and insightful form
• These variables are called “factors” or “principal
components”
5
Principal Component Analysis
• Most common form of factor analysis
• The new variables/dimensions
– Are linear combinations of the original ones
– Are uncorrelated with one another
• Orthogonal in dimension space
– Capture as much of the original variance in the
data as possible
– Are called Principal Components
6
Original Variable B
What are the new axes?
PC 2
PC 1
Original Variable A
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis
7
Principal Components
• First principal component is the direction of greatest
variability (covariance) in the data
• Second is the next orthogonal (uncorrelated)
direction of greatest variability
– So first remove all the variability along the first
component, and then find the next direction of greatest
variability
• And so on …
8
Principal Components Analysis (PCA)
• Principle
– Linear projection method to reduce the number of
parameters
– Transfer a set of correlated variables into a new set of
uncorrelated variables
– Map the data into a space of lower dimensionality
• Properties
– It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
– New axes are orthogonal and represent the directions with
maximum variability
9
Algebraic definition of PCs
Given a sample of n observations on a vector of p variables
x1, x2 ,, xn  
p
define the first principal component of the sample
by the linear transformation
p
z1  a1T x j   ai1 xij ,
j  1,2,, n.
i 1
where the vector
a1  (a11 , a21 ,  , a p1 )
x j  ( x1 j , x2 j ,  , x pj )
is chosen such that
var[ z1 ]
is maximum.
10
Algebraic derivation of PCs
To find
a1
first note that


n
2
1
2
T
T
var[ z1 ]  E (( z1  z1 ) )   a1 xi  a1 x
n i 1



T
1 n T
  a1 xi  x xi  x a1  a1T Sa1
n i 1
where



T
1 n
S   xi  x xi  x
n i 1
n
is the covariance matrix.
1
x   xi is the mean.
n i 1
In the following, we assume the
Data is centered.
x0
11
Algebraic derivation of PCs
Assume
Form the matrix:
then
x0
X  [ x1 , x2 ,, xn ]  
pn
1
S  XX T
n
12
Algebraic derivation of PCs
To find
a1 that
maximizes var[ z1 ] subject to a1T a1  1
Let λ be a Lagrange multiplier
L  a1T Sa1   (a1T a1  1)

L  Sa1  a1  0
a1
 Sa1  a1
 a1T Sa1  
therefore
a1
is an eigenvector of S
corresponding to the largest eigenvalue
  1.
13
Algebraic derivation of PCs
To find the next coefficient vector
subject to
and to
cov[ z2 , z1 ]  0
a2 maximizing var[ z2 ]
uncorrelated
a2T a2  1
cov[ z2 , z1 ]  a Sa2   a a2
T
1
T
1 1
then let λ and φ be Lagrange multipliers, and maximize
L  a Sa2   (a a2  1)  a a
T
2
T
2
T
2 1
14
Algebraic derivation of PCs
We find that
a2
whose eigenvalue
is also an eigenvector of S
  2 is the second largest.
In general
var[ zk ]  a Sak  k
T
k
• The kth largest eigenvalue of S is the variance of the kth PC.
• The kth PC z k retains the kth greatest fraction of the variation
in the sample.
15
Algebraic derivation of PCs
• Main steps for computing PCs
– Form the covariance matrix S.
– Compute its eigenvectors: a 
p
i i 1
– Use the first d eigenvectors
d PCs.
a 
d
i i 1
to form the
– The transformation G is given by
G  [a1 , a2 ,, ad ]
A test point x    G x   .
p
T
d
16
Dimensionality Reduction
Original data
reduced data
Linear transformation
G 
T
Y  d
d p
X p
G 
pd
: X  Y  G X 
T
d
17
Steps of PCA
• Let X be the mean vector (taking the mean of
all rows)
• Adjust the original data by the mean
X’ = X – X
• Compute the covariance matrix S of adjusted X
• Find the eigenvectors and eigenvalues of S.
18
Principal components - Variance
25
Variance (%)
20
15
10
5
0
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
19
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first d eigenvectors ai; where d is the number
of top eigenvalues
• These are the directions with the largest variances

 yi1   a1  xi1  x1 

    
 yi 2   a2  xi 2  x2 
 ...    ...  ... 

    
 y   a  x  x 
 id   d  in
n
20
An Example
X1
X2
19
X1'
63
Mean1=24.1
Mean2=53.8
X2'
-5.1 9.25
100
90
39
74
14.9 20.25
80
70
60
50
30
87
Series1
40
5.9 33.25
30
20
10
0
30
23
5.9 -30.75
15
35
-9.1 -18.75
0
10
20
30
40
50
40
30
20
15
43
-9.1 -10.75
10
0
-15
15
32
-9.1 -21.75
-10
-5
-10
Series1
0
5
10
15
20
-20
-30
-40
30
73
5.9 19.25
21
Covariance Matrix
• C=
75 106
106 482
• We find out:
– Eigenvectors:
– a2=(-0.98,-0.21), 2=51.8
– a1=(0.21,-0.98), 1=560.2
22
Transform to One-dimension
0.5
yi
0.4
-10.14
0.3
• We keep the dimension
of a1=(0.21,-0.98)
• We can obtain the final
data as
-40
-20
0.2
-16.72
0.1
0
-31.35
-0.1 0
20
31.374
40
16.464
-0.2
-0.3
8.624
-0.4
19.404
-0.5
-17.63
 xi1 
yi  0.21  0.98   0.21* xi1  0.98 * xi 2
 xi 2 
23
Download