Introduction to Kernel Principal Component Analysis(PCA)

advertisement
Introduction to Kernel Principal
Component Analysis(PCA)
Mohammed Nasser
Dept. of Statistics, RU,Bangladesh
Email: mnasser.ru@gmail.com
1
Contents
Basics of PCA
Application of PCA in Face Recognition
Some Terms in PCA
Motivation for KPCA
Basics of KPCA
Applications of KPCA
High-dimensional Data
Gene expression
Face images
Handwritten digits
Why Feature Reduction?
• Most machine learning and data mining techniques may
not be effective for high-dimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as the
dimension increases.
• The intrinsic dimension may be small.
– For example, the number of genes responsible for a
certain type of disease may be small.
Why Reduce Dimensionality?
1.
2.
3.
4.
5.
6.
Reduces time complexity: Less computation
Reduces space complexity: Less parameters
Saves the cost of observing the feature
Simpler models are more robust on small datasets
More interpretable; simpler explanation
Data visualization (structure, groups, outliers, etc) if
plotted in 2 or 3 dimensions
Feature reduction algorithms
• Unsupervised
– Latent Semantic Indexing (LSI): truncated SVD
– Independent Component Analysis (ICA)
– Principal Component Analysis (PCA)
– Canonical Correlation Analysis (CCA)
• Supervised
– Linear Discriminant Analysis (LDA)
• Semi-supervised
– Research topic
Algebraic derivation of PCs
•
Main steps for computing PCs
– Form the covariance matrix S.
– Compute its eigenvectors:
 u i  i 1
– Use the first d eigenvectors
to form the d PCs.
p
 u i  i 1
d
– The transformation G is given by
G  [ u1 , u 2 ,
A test point
x  G x .
p
T
d
, ud ]
Optimality property of PCA
Reconstruction
Dimension reduction X   p  n  G T X   d  n
Original data
G 
T
G X 
T
d n
 X  G (G X )  
T
pn
d p
Y  G X 
T
X 
X 
pn
pn
G
pd
d n
Optimality property of PCA
Main theoretical result:
The matrix G consisting of the first d eigenvectors of the
covariance matrix S solves the following min problem:
min
X  G (G X )
T
G 
pd
2
subject to
F
X  X
G G  Id
T
2
F
reconstruction error
PCA projection minimizes the reconstruction error among all
linear projections of size d.
Dimensionality Reduction
• One approach to deal with high dimensional data is by
reducing their dimensionality.
• Project high dimensional data onto a lower
dimensional sub-space using linear or non-linear
transformations.
Dimensionality Reduction
• Linear transformations are simple to compute and
tractable.
t
Y U X
( bi  u i a i )
kx1
dx1
kxd
(k<<d)
• Classical –linear- approaches:
– Principal Component Analysis (PCA)
– Fisher Discriminant Analysis (FDA)
–Singular Value Decomosition (SVD)
--Factor Analysis (FA)
--Canonical Correlation(CCA)
Principal Component Analysis (PCA)
• Each dimensionality reduction technique finds an
appropriate transformation by satisfying certain criteria
(e.g., information loss, data discrimination, etc.)
• The goal of PCA is to reduce the dimensionality of the data
while retaining as much as possible of the variation
present in the dataset.
Principal Component Analysis (PCA)
• Find a basis in a low dimensional sub-space:
– Approximate vectors by projecting them in a low
dimensional sub-space:
(1) Original space representation:
x  a1 v1  a 2 v 2  ...  a N v N
w h ere v1 , v 2 , ..., v n is a b ase in th e o rig in al N -d im en si o n al sp ace
(2) Lower-dimensional sub-space representation:
xˆ  b1u 1  b 2 u 2  ...  b K u K
w h ere u 1 , u 2 , ..., u K is a b ase in th e K -d im en sio n al su b -s p ace (K < N )
• Note: if K=N, then
xˆ  x
Principal Component Analysis (PCA)
• Example (K=N):
Principal Component Analysis (PCA)
• Methodology
– Suppose x1, x2, ..., xM are N x 1 vectors
Principal Component Analysis (PCA)
• Methodology – cont.
bi  u i ( x  x )
T
Principal Component Analysis (PCA)
• Linear transformation implied by PCA
– The linear transformation RN  RK that performs the dimensionality
reduction is:
Principal Component Analysis (PCA)
• How many principal components to keep?
– To choose K, you can use the following criterion:
Unfortunately for some data sets to meet this
requirement we need K almost equal to N. That is, no
effective data reduction is possible.
Principal Component Analysis (PCA)
• Eigenvalue spectrum
K
λ
Scree iplot
λN
Principal Component Analysis (PCA)
• Standardization
– The principal components are dependent on the units
used to measure the original variables as well as on the
range of values they assume.
– We should always standardize the data prior to using
PCA.
– A common standardization method is to transform all the
data to have zero mean and unit standard deviation:
CS 479/679
Pattern Recognition – Spring 2006
Dimensionality Reduction Using PCA/LDA
Chapter 3 (Duda et al.) – Section 3.8
Case Studies:
Face Recognition Using Dimensionality Reduction
M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, 3(1), pp. 71-86, 1991.
D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image Retrieval", IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(8), pp. 831-836, 1996.
A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 23, no. 2, pp. 228-233, 2001.
Principal Component Analysis (PCA)
• Face Recognition
– The simplest approach is to think of it as a template
matching problem
– Problems arise when performing recognition in a
high-dimensional space.
– Significant improvements can be achieved by first
mapping the data into a lower dimensionality
space.
– How to find this lower-dimensional space?
Principal Component Analysis (PCA)
• Main idea behind eigenfaces
average face
Principal Component Analysis (PCA)
• Computation of the eigenfaces
Principal Component Analysis (PCA)
• Computation of the eigenfaces – cont.
Principal Component Analysis (PCA)
• Computation of the eigenfaces – cont.
Mind that this is normalized..
ui
Principal Component Analysis (PCA)
• Computation of the eigenfaces – cont.
Principal Component Analysis (PCA)
• Representing faces onto this basis
Principal Component Analysis (PCA)
• Representing faces onto this basis – cont.
Principal Component Analysis (PCA)
• Face Recognition Using Eigenfaces
Principal Component Analysis (PCA)
• Face Recognition Using Eigenfaces – cont.
– The distance er is called distance within the face space
(difs)
– Comment: we can use the common Euclidean distance
to compute er, however, it has been reported that the
Mahalanobis distance performs better:
Principal Component Analysis (PCA)
• Face Detection Using Eigenfaces
Principal Component Analysis (PCA)
• Face Detection Using Eigenfaces – cont.
Principal Components Analysis
So, principal components are given by:
b1 = u11x1 + u12x2 + ... + u1NxN
b2 = u21x1 + u22x2 + ... + u2NxN
...
bN= aN1x1 + aN2x2 + ... + aNNxN
xj’s are standardized if correlation matrix is used (mean
0.0, SD 1.0)
Score of ith unit on jth principal component
bi,j = uj1xi1 + uj2xi2 + ... + ujNxiN
PCA Scores
5
xi2
bi,1
4
bi,2
3
2
4.0
4.5
5.0
xi1
5.5
6.0
Principal Components Analysis
Amount of variance accounted for by:
1st principal component, λ1, 1st eigenvalue
2nd principal component, λ2, 2ndeigenvalue
...
λ1 > λ2 > λ3 > λ4 > ...
Average λj = 1 (correlation matrix)
Principal Components Analysis:
Eigenvalues
5
λ2
λ1
4
3
U1
2
4.0
4.5
5.0
5.5
6.0
PCA: Terminology
• jth principal component is jth eigenvector of
correlation/covariance matrix
• coefficients, ujk, are elements of eigenvectors and
relate original variables (standardized if using
correlation matrix) to components
• scores are values of units on components (produced
using coefficients)
• amount of variance accounted for by component is
given by eigenvalue, λj
• proportion of variance accounted for by
component is given by λj / Σ λj
• loading of kth original variable on jth component is
given by ujk √λj --correlation between variable and
component
Principal Components Analysis
• Covariance Matrix:
– Variables must be in same units
– Emphasizes variables with most variance
– Mean eigenvalue ≠1.0
– Useful in morphometrics, a few other cases
• Correlation Matrix:
– Variables are standardized (mean 0.0, SD 1.0)
– Variables can be in different units
– All variables have same impact on analysis
– Mean eigenvalue = 1.0
PCA: Potential Problems
• Lack of Independence
– NO PROBLEM
• Lack of Normality
– Normality desirable but not essential
• Lack of Precision
– Precision desirable but not essential
• Many Zeroes in Data Matrix
– Problem (use Correspondence Analysis)
Principal Component Analysis (PCA)
• PCA and classification (cont’d)
0
-2
-4
v
2
4
Motivation
-3
-2
-1
0
z
1
2
3
Motivation
0
2
4
u
6
8
???????
-3
-2
-1
0
z
1
2
3
Motivation
Linear projections
will not detect the
pattern.
Limitations of linear PCA
1,2,3=1/3
Nonlinear PCA
Three popular methods are available:
1) Neural-network based PCA (E. Oja, 1982)
2)Method of Principal Curves (T.J. Hastie and W. Stuetzle,
1989)
3) Kernel based PCA (B. Schölkopf, A. Smola, and K.
Müller, 1998)
PCA
NPCA
Kernel PCA: The main idea
A Useful Theorem for Hilbert
space
Let H be a Hilbert space and x1, ……xn in H. Let
M=span{x1, ……xn}. Also u and v in M.
<xi,u>=<xi,v>, i=1,……,n implies u=v
Proof.
Try your self.
Kernel methods in PCA
Linear PCA
Cw   w
( 1)
where C is covariance matrix for centered data X:
C
1
n
Cw 
1
l

'
xixi
i 1
n

n
(x i ' w ) x i   w
i 1
 w  span{ x1 , ..... x 2 } if   0
  xi , w  xi , Cw  i=1......l
(2)
(1) and (2) are equivalent conditions.
Kernel methods in PCA
Now let us suppose:
 :R
p
 F , t h e fea t u r e sp a ce
Possibly F is a very high dimension space.
In Kernel PCA, we do the PCA in feature space.
C 
1
l
l
  ( x i ) ( x i )
T
( w h a t is it s m ea n in g? ? )
i 1
v  C v 
1
l
l

(xi ), v (xi )
i 1
remember about centering!
(*)
Kernel Methods in PCA
Again all solutions
generated by
v with
{ ( x i ) ,
0
lie in the space
,  (x l ) }
It has two useful consequences:
v  sp a n o f { ( x i ) ,
1}
,  (x l ) }
l
 v 

i
 (x i )
i 1
2)
We may instead solve the set of equations
   ( xi ), v   ( xi ), Cv  i=1......l
Kernel Methods in PCA
Defining an lxl kernel matrix K:
k x i , x
j

 ( x i ),  ( x j )
And using the result (1) in ( 2) we get
lK   K
2

( 3)
But we need not solve (3). It can be shown easily
that the following simpler system gives us solutions
that are interesting to us.
l  K 
( 4)
Kernel Methods in PCA
Compute eigenvalue problem for the kernel matrix
Kα  α
The solutions (k, k) further need to be normalized
by imposing
k   , 
k
 1 sin ce v
k
k
sh ou ld b e w it h
v
k
1
If
x is our new observation, the feature value (??)
will be  ( x )
and kth principal score will be
l
 v ,  ( x ) 
k
  i   ( x ),  ( x i ) 
k
i 1
l
  i K (x , x i )
k
i 1
Kernel Methods in PCA
Data centering:
ˆ ( x )   ( x )   S ( x )   ( x ) 
l
1
 (x

l
i
)
i 1
Hence, the kernel for the transformed space is
kˆ ( x, z )  ˆ ( x ), ˆ ( z )   ( x ) 
1
l
 (x

l
i
) , ˆ ( z ) 
1
i 1
 k ( x, z ) 
1
l
k (x, x

l
i 1
i
)
1
l
k ( z, x

l
i 1
i
)
l
 (x

l
i 1
1
l
2
l
 k (x
i , j 1
i
,x j)
i
) 
Kernel Methods in PCA
Expressed as an operation on the kernel matrix this
can be rewritten as
1
1
1
ˆ
K  K  j  j' K  K  j  j'  2 (j' K  j)  j  j'
l
l
l
where j is the all 1s vector.
Linear PCA
Kernel PCA
captures the nonlinear
structure of the data
Linear PCA
Kernel PCA
captures the nonlinear
structure of the data
Algorithm
Input: Data X={x1, x2, …, xl} in n-dimensional space.
Process: Ki,j= k(xi,xj); i,j=1,…, l.
ˆ  K  1 j  j'  K  1 K  j  j'  1 ( j'  K  j)  j  j';
K
2
l
l
l
ˆ );
[V ,  ]  eig( K

(j )
1

j
vj,
… for centered data
j  1, ..., l .
k


(j )
x j    i k(xi, x) 
 i 1
 j 1
l
Kernel matrix ...
Output: Transformed data
k-dimensional vector projection of new
data into this subspace
Reference
• I.T. Jolliffe. (2002)Principal Component Analysis.
• . Schölkopf, et al. (1998 Kernel Principal Component
Analysis)/
• B. . Schölkopf and A.J. Smola(2000/20012002)
Learning with Kernels
• Christopher J C Burges (2005).Geometric Methods for
Feature Extraction and Dimensional Reduction.
Download