Principal Component Analysis

advertisement
Principal Component Analysis
Jieping Ye
Department of Computer Science and
Engineering
Arizona State University
http://www.public.asu.edu/~jye02
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis (PCA)
Nonlinear PCA using Kernels
What is feature reduction?
• Feature reduction refers to the mapping of the original highdimensional data onto a lower-dimensional space.
– Criterion for feature reduction can be different based on different
problem settings.
• Unsupervised setting: minimize the information loss
• Supervised setting: maximize the class discrimination

• Given a set of data points of p variables
x1 , x2 ,, xn
Compute the linear transformation (projection)
G   pd : x   p  y  GT x  d (d  p)

What is feature reduction?
Original data
reduced data
Linear transformation
Y  d
GT  d p
X p
G   p d : X  Y  G T X   d
High-dimensional data
Gene expression
Face images
Handwritten digits
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
Why feature reduction?
• Most machine learning and data mining techniques may
not be effective for high-dimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as the dimension
increases.
• The intrinsic dimension may be small.
– For example, the number of genes responsible for a certain type
of disease may be small.
Why feature reduction?
• Visualization: projection of high-dimensional data onto
2D or 3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
Application of feature reduction
•
•
•
•
•
•
Face recognition
Handwritten digit recognition
Text mining
Image retrieval
Microarray data analysis
Protein classification
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
Feature reduction algorithms
• Unsupervised
–
–
–
–
Latent Semantic Indexing (LSI): truncated SVD
Independent Component Analysis (ICA)
Principal Component Analysis (PCA)
Canonical Correlation Analysis (CCA)
• Supervised
– Linear Discriminant Analysis (LDA)
• Semi-supervised
– Research topic
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
What is Principal Component Analysis?
• Principal component analysis (PCA)
– Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
– Retains most of the sample's information.
– Useful for the compression and classification of data.
• By information we mean the variation present in the sample,
given by the correlations between the original variables.
– The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total information
each retains.
Geometric picture of principal components (PCs)
z1
• the 1st PC
z1 is a minimum distance fit to a line in
X space
• the 2nd PC z2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
Algebraic definition of PCs
Given a sample of n observations on a vector of p variables
x1 , x2 ,, xn  
p
define the first principal component of the sample
by the linear transformation
p
z1  a1T x j   ai1 xij ,
j  1,2,, n.
i 1
where the vector
a1  (a11, a21,, a p1 )
x j  ( x1 j , x2 j ,, x pj )
is chosen such that
var[ z1 ]
is maximum.
Algebraic derivation of PCs
To find
a1
first note that

n
1
var[ z1 ]  E (( z1  z1 ) 2 )   a1T xi  a1T x
n i 1



2

T
1 n T
  a1 xi  x xi  x a1  a1T Sa1
n i 1
where



T
1 n
S   xi  x xi  x
n i 1
1 n
is the covariance matrix.
x
x is the mean.

n
i 1
In the following, we assume the
Data is centered.
i
x0
Algebraic derivation of PCs
Assume
Form the matrix:
then
x0
X  [ x1 , x2 ,, xn ]  
p n
1
S  XX T
n
Obtain eigenvectors of S by computing the SVD of X:
X  UV
T
Algebraic derivation of PCs
To find
a1 that
maximizes var[ z1 ] subject to a1T a1  1
Let λ be a Lagrange multiplier
L  a1T Sa1   (a1T a1  1)

L  Sa1  a1  0
a1
 ( S  I p )a1  0
therefore
a1
is an eigenvector of S
corresponding to the largest eigenvalue
  1.
Algebraic derivation of PCs
To find the next coefficient vector
subject to
and to
First note that
cov[ z2 , z1 ]  0
a2
maximizing var[ z2 ]
uncorrelated
a2T a2  1
cov[ z2 , z1 ]  a Sa2   a a2
T
1
T
1 1
then let λ and φ be Lagrange multipliers, and maximize
L  a Sa2   (a a2  1)  a a
T
2
T
2
T
2 1
Algebraic derivation of PCs
L  a Sa2   (a a2  1)  a a
T
2
T
2
T
2 1

L  Sa2  a2  a1  0    0
a2
Sa2  a2
and   a Sa2
T
2
Algebraic derivation of PCs
We find that
a2
whose eigenvalue
is also an eigenvector of S
  2 is the second largest.
In general
var[ z k ]  a Sak  k
T
k
• The kth largest eigenvalue of S is the variance of the kth PC.
• The kth PC zk retains the kth greatest fraction of the variation
in the sample.
Algebraic derivation of PCs
• Main steps for computing PCs
– Form the covariance matrix S.
a 
d
– Use the first d eigenvectors a 
i i 1
– Compute its eigenvectors:
p
i i 1
to form the d PCs.
– The transformation G is given by
G  [a1 , a2 ,, ad ]
A test point x    G x   .
p
T
d
Optimality property of PCA
Dimension reduction X   pn  GT X  d n
Original data
Reconstruction
GT X  d n  X  G(GT X )   pn
GT  d p
Y  G T X   d n
X   p n
X   pn
G 
p d
Optimality property of PCA
Main theoretical result:
The matrix G consisting of the first d eigenvectors of the
covariance matrix S solves the following min problem:
2
min G pd X  G(G X ) subject to G TG  I d
T
F
X X
2
F
reconstruction error
PCA projection minimizes the reconstruction error among all
linear projections of size d.
Applications of PCA
• Eigenfaces for recognition. Turk and Pentland. 1991.
• Principal Component Analysis for clustering gene
expression data. Yeung and Ruzzo. 2001.
• Probabilistic Disease Classification of ExpressionDependent Proteomic Data from Mass Spectrometry of
Human Serum. Lilien. 2003.
PCA for image compression
d=1
d=16
d=2
d=32
d=4
d=64
d=8
d=100
Original
Image
Outline of lecture
•
•
•
•
•
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels
Motivation
Linear projections
will not detect the
pattern.
Nonlinear PCA using Kernels
• Traditional PCA applies linear transformation
– May not be effective for nonlinear data
• Solution: apply nonlinear transformation to potentially very highdimensional space.
 : x   ( x)
• Computational efficiency: apply the kernel trick.
– Require PCA can be rewritten in terms of dot product.
K ( xi , x j )   ( xi )   ( x j )
More on kernels
later
Nonlinear PCA using Kernels
Rewrite PCA in terms of dot product
Assume the data has been centered, i.e.,  xi  0.
i
1
The covariance matrix S can be written as S   xi xiT
n i
Let v be The eigenvector of S corresponding to
nonzero eigenvalue
1
1
T
T
Sv   xi xi v  v  v 
(
x

i v) xi
n i
n i
Eigenvectors of S lie in the space spanned by all data points.
Nonlinear PCA using Kernels
1
1
T
T
Sv   xi xi v  v  v 
(
x

i v) xi
n i
n i
The covariance matrix can be written in matrix form:
1
S  XX T , where X  [x 1 , x 2 , , x n ].
n
v    i xi  X
i
1
Sv  XX T X  X
n
1 T
( X X )( X T X )   ( X T X )
n
1 T
( X X )  
n
Any benefits?
Nonlinear PCA using Kernels
 : x   ( x)
Next consider the feature space:
 
1   T
S  X X , where X   [x 1 , x 2 , , x n ].
n
T
1


v    i ( xi )  X 
X
X   



n
i
The (i,j)-th entry of
X 
Apply the kernel trick:
 T
X

is
 ( xi )   ( x j )
K ( xi , x j )   ( xi )   ( x j )
K is called the kernel matrix.
1
K  
n
Nonlinear PCA using Kernels
• Projection of a test point x onto v:
 ( x)  v   ( x)    i ( xi )
i
   i ( x)   ( xi )    i K ( x, xi )
i
i
Explicit mapping is not required here.
Reference
• Principal Component Analysis. I.T. Jolliffe.
• Kernel Principal Component Analysis. Schölkopf, et al.
• Geometric Methods for Feature Extraction and
Dimensional Reduction. Burges.
Download