PCA

advertisement
- Southeast University -
PCA and Kernel PCA
Presented by Shicai Yang
Institute of Systems Engineering
April 13, 2015
1
- Southeast University -
Outline
• PCA
• Kernel Methods
• Kernel PCA
• Others
2
- Southeast University -
1. PCA Overview
• Principal component analysis (PCA) is a way to reduce
data dimensionality
• PCA projects high dimensional data to a lower
dimension
• PCA projects the data in the least square sense– it
captures big (principal) variability in the data and ignores
small variability
3
- Southeast University -
PCA: An Intuitive Approach
Let us say we have xi, i=1…N data points in p dimensions (p is large)
If we want to represent the data set by a single point x0, then
1
x0  m 
N
Sample mean
N
x
i 1
i
Can we justify this choice mathematically?
N
J 0 (x 0 )   x i  x 0
2
i 1
It turns out that if you minimize J0,
you get the above solution, namely, sample mean
4
- Southeast University -
PCA: An Intuitive Approach…
Representing the data set xi, i=1…N by its mean is quite uninformative
So let’s try to represent the data by a straight line of the form:
x  m  ae
This is equation of a straight line that says that it passes through m
e is a unit vector along the straight line
And the signed distance of a point x from m is a
The training points projected on this straight line would be
xi  m  aie,
i  1...N
5
- Southeast University -
PCA: An Intuitive Approach…
Let’s now determine ai’s
N
J1 (a1 , a2 ,, a N , e)   m  ai e  xi
2
i 1
N
N
N
  a || e ||  2 ai e (xi  m)   || xi  m ||2
i 1
2
i
2
T
i 1
i 1
N
N
N
i 1
i 1
i 1
  ai2  2 ai eT (xi  m)   || xi  m ||2
Partially differentiating with respect to ai we get: ai  eT (xi  m)
Plugging in this expression for ai in J1 we get:
N
N
N
J1 (e)   e (xi  m)(xi  m) e   || xi  m ||  e Se   || xi  m ||2
T
T
i 1
2
i 1
T
i 1
N
where S   (xi  m)(xi  m)T is called the scatter matrix
i 1
6
- Southeast University -
PCA: An Intuitive Approach…
So minimizing J1 is equivalent to maximizing:
eT Se
T
Subject to the constraint that e is a unit vector: e e  1
Use Lagrange multiplier method to form the objective function:
eT Se   (eT e 1)
Differentiate to obtain the equation: 2Se  2e  0 or Se  e
Solution is that e is the eigenvector of S corresponding to the largest eigenvalue
7
- Southeast University -
PCA: An Intuitive Approach…
The preceding analysis can be extended in the following way.
Instead of projecting the data points on to a straight line, we may
now want to project them on a d-dimensional plane of the form:
x  m  a1e1    ad ed
d is much smaller than the original dimension p
In this case one can form the objective function: J d 
N
d
 || (m   a
i 1
k 1
e )  xi ||2
ik k
It can also be shown that the vectors e1, e2, …, ed are d eigenvectors
N
corresponding to d largest eigen values of the scatter matrix S   (xi  m)(xi  m)T
i 1
8
- Southeast University -
PCA: Visually
Data points are represented in a rotated orthogonal coordinate system: the origin
is the mean of the data points and the axes are provided by the eigenvectors.
9
- Southeast University -
PCA Steps
• 设x = ( x1 , x2 , ⋯ , xn)T为n 维随机矢量
⑴ 将原始观察数据组成观察矩阵X,每一列为一个观察
样本,每一行为一维
⑵ 计算样本X的协方差矩阵 covX=COV(X)
⑶ 计算covX的特征值和特征向量,并将特征值按从大到
小排列
⑷ 选取前m个最大特征值对应的特征向量组成矩阵V
⑸ Y=VTX,则Y为降维后的矩阵
10
- Southeast University -
PCA的Matlab函数与算法
1.princomp:主成分分析
• PC=princomp(X)
• [PC,score,latent,tsquare]=princomp(X)
– 对数据矩阵X(N*p,行-观察样本数,列-特征变量数)进行主成分分
析,给出各主成分(PC)、所谓的Z-得分(score)、X的方差矩阵的
特征值(latent)和每个数据点的HotellingT2统计量(tsquare)。
2.pcacov:运用协方差矩阵进行主成分分析
• PC=pcacov(X)
• [PC,latent,explained]=pcacov(X)
– 通过协方差矩阵X进行主成分分析,返回主成分(PC)、协方差矩阵
X的特征值(latent)和每个特征向量表征在观测量总方差中所占的
百分数(explained)。
11
- Southeast University -
3.pcares:主成分分析的残差
• residuals=pcares(X,ndim)
– 返回保留X的ndim个主成分所获的残差。注意,ndim是一个标量,
必须小于X的列数。而且,X是数据矩阵,而不是协方差矩阵。
4.barttest:主成分的巴特力特检验
• ndim=barttest(X,alpha)
• [ndim,prob,chisquare]=barttest(X,alpha)
– 巴特力特检验是一种等方差性检验。ndim=barttest(X,alpha)是在
显著性水平alpha下,给出满足数据矩阵X的非随机变量的n维模型,
ndim即模型维数,它由一系列假设检验所确定,ndim=1表明数据
X对应于每个主成分的方差是相同的;ndim=2表明数据X对应于第
二成分及其余成分的方差是相同的。
12
- Southeast University -
计算协方差
(1)XCOV=COV(X)
(2) % row观察样本,col特征变量,返回的cv为协方差
xmean=mean(x); xsize=size(x);
for i=1:xsize(2)
xx1=x(:,i);
mxx1=xmean(i);
for j=1:xsize(2)
xx2=x(:,j);
mxx2=xmean(j);
v=((xx1-mxx1)'*(xx2-mxx2))/(xsize(2)-1);
cv(i,j)=v;
cv(j,i)=v;
end
end
13
- Southeast University -
PCA的Matlab实现
function [xeigvsort,xeigdsort,final]=KL_Exp(x)
xmean=mean(x);xsize=size(x);
for i=1:xsize(2)
xadjust(:,i)=x(:,i)-xmean(:,i);
end
xcov=cov(xadjust); %计算协方差
[xeigv,xeigd]=eig(xcov); %计算特征值和特征向量
xeigvsort=fliplr(xeigv); %特征向量v排序
xeigdsort=flipud(fliplr(xeigd)); %特征值d降序排序
finaleigs=xeigvsort(:,1:xsize(2));选取变换基,xsize(2)可调
pdata=finaleigs‘*xadjust’; %进行变换
final=pdata';
14
- Southeast University -
假设和局限
• 线形性假设
– PCA的内部模型是线性的。这也就决定了它能进行的主元分析之
间的关系也是线性的。现在比较流行的Kernel-PCA的一类方法就
是对原有PCA方法的非线性拓展。
• 使用中值和方差进行充分统计
– 使用中值和方差进行充分的概率分布描述的模型只限于指数型概
率分布模型。若数据的概率分布是non-Gaussian,那么PCA将会
失效,ICA方法将发挥作用。
15
- Southeast University -
• 大方差向量具有较大重要性
– PCA方法隐含了这样的假设:数据本身具有较高的信噪比,所以
具有最高方差的一维向量就可以被看作是主元,而方差较小的变
化则被认为是噪音。这是由于低通滤波器的选择决定的。
• 主元正交
– PCA方法假设主元向量之间都是正交的,从而可以利用线形代数
的一系列有效的数学工具进行求解,大大提高了效率和应用的范
围。
16
- Southeast University -
2. Kernel Methods
• Find a mapping f such that, in the new space, problem
solving is easier (e.g. linear)
• The kernel represents the similarity between two objects,
defined as the dot-product in this new vector space
• But the mapping is left implicit
• Easy generalization of a lot of dot-product (or distance)
based pattern recognition algorithms
17
- Southeast University -
Kernel Methods : the mapping
f
f
f
Original Space
Feature (Vector) Space
18
- Southeast University -
Feature Spaces
 : x  ( x), R  F
d
Non-linear Mapping to F
1. High-d Space
2. Infinite-d Countable Space: L2
3. Function Space (Hilbert Space)
Example:

( x, y)  ( x , y , 2 xy)
2
2
19
- Southeast University -
Kernel : more formal definition
• A kernel k(x,y)
–
–
–
–
is a similarity measure
defined by an implicit mapping f,
from the original space to a vector space (feature space)
such that: k(x,y)=f(x)•f(y)
• This similarity measure and the mapping include:
–
–
–
–
–
Invariance or other a priori knowledge
Simpler structure (linear representation of the data)
The class of functions the solution is taken from
Possibly infinite dimension (hypothesis space for learning)
… but still computational efficiency when computing k(x,y)
General Principles governing
Kernel Design
20
- Southeast University -
Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick:
Replace :
x  ( x),
kernel
Gij  xi , x j  Gij  ( xi ), ( x j )  K ( xi , x j )
If we use algorithms that only depend on the Gram-matrix, G,
then we never have to know (compute) the actual features 
This is the crucial point of kernel methods
21
- Southeast University -
Modularity
Kernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)
2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
Some Kernels:
k ( x, y)  e( || x  y||
2
Some Kernel Algorithms:
/ c)
k ( x, y)  ( x, y   )d
k ( x, y)  tanh(  x, y   )
1
k ( x, y) 
|| x  y ||2 c 2
- SVM
- Fisher LDA(KFDA)
- Kernel Regression
- Kernel PCA
- Kernel CCA
22
- Southeast University -
Benefits from kernels
• Generalizes (nonlinearly) pattern recognition algorithms
in clustering, classification, density estimation, …
– When these algorithms are dot-product based, by replacing the
dot product (x•y) by k(x,y)=f(x)•f(y)
e.g.: linear discriminant analysis, logistic regression, perceptron,
SOM, PCA, ICA, …
NM. This often implies to work with the “dual” form of the algo.
– When these algorithms are distance-based, by replacing d(x,y)
by k(x,x)+k(y,y)-2k(x,y)
• Freedom of choosing f implies a large variety of learning
algorithms
23
- Southeast University -
3. Kernel PCA
• Assumption behind PCA is that the data points x are
multivariate Gaussian
• Often this assumption does not hold
• However, it may still be possible that a transformation f(x) is
still Gaussian, then we can perform PCA in the space of f(x)
• Kernel PCA performs this PCA; however, because of “kernel
trick,” it never computes the mapping f(x) explicitly!
24
- Southeast University -
KPCA: Basic Idea
25
- Southeast University -
Kernel PCA Formulation
• We need the following fact:
N
• Let v be a eigenvector of the scatter matrix: S   xi xTi
i 1
• Then v belongs to the linear space spanned by the data
points xi i=1, 2, …N.
• Proof:
Sv  v  v 
1
N
x (x


i 1
i
N
T
i
v)    i x i
i 1
26
- Southeast University -
Kernel PCA Formulation…
• Let C be the scatter matrix of the centered mapping f(x):
N
C   f (xi )f (xi )T
i 1
• Let w be an eigenvector of C, then w can be written as a
linear combination:
N
w    kf (x k )
k 1
• Also, we have: Cw  w
• Combining, we get:
N
N
N
k 1
k 1
(f (xi )f (xi ) )( kf (x k ))    kf (x k )
T
i 1
27
- Southeast University -
Kernel PCA Formulation…
N
N
N
k 1
k 1
( f (x i )f (x i ) )(  kf (x k ))     kf (x k ) 
T
i 1
N
N
N
 f (xi )f (xi ) f (x k ) k     kf (x k ) 
T
i 1 k 1
N
k 1
N
 f (x )
i 1 k 1
Sv  v
l
N
T
f (x i )f (x i ) f (x k ) k     kf (x l )T f (x k ), l  1,2,  , N 
T
k 1
K 2 α   Kα 
Kα  α, where K ij  f (x i )T f (x j ).
Kernel or Gram matrix
28
- Southeast University -
Kernel PCA Formulation…
Kα   α
From the eigen equation
And the fact that the eigenvector w is normalized to 1, we obtain:
N
N
|| w ||  (  if (x i )) (  if (x i ))  α T Kα  1 
2
T
i 1
αT α 
i 1
1

29
- Southeast University -
KPCA Algorithm
Step 1: Compute the Gram matrix: Kij  k (xi , x j ), i, j  1,, N
Step 2: Compute (eigenvalue, eigenvector) pairs of K: (α , l ), l  1,, M
l
Step 3: Normalize the eigenvectors:
αl 
αl
l
N
Thus, an eigenvector
wl
of C is now represented as:
w   kl f (x k )
l
k 1
To project a test feature f(x) onto wl we need to compute:
N
N
f (x) w  f (x) ( f (x k ))   kl k (x k , x)
T
l
T
k 1
l
k
k 1
So, we never need f explicitly
30
- Southeast University -
Examples of Kernels
f
Polynomial kernel
(n=2)
RBF kernel (n=2)
31
- Southeast University -
4. Others
• 2DPCA
– 南京理工大学杨静宇教授等, IEEE T-PAMI, 2004(1)
– 2DPCA特征提取效果至少要好于PCA, 不过要求的内存比PCA 大。
• 2DLDA
– 北京交通大学袁保宗教授等, P. R. Letters, 2005(3)
• Kernel ECA(KECA)
– Robert Jensson et.al, IEEE T-PAMI, 2010(5)
– 最大熵保留,使熵减最少,巧妙地将熵与核方法数据映射结合,
将熵的计算顺水推舟化作核矩阵的计算,于是变成一个核空间里
的优化问题
32
- Southeast University -
References
[1] J. T. Y. Kwok and I. W. H. Tsang, "The Pre-Image Problem in Kernel Methods,"
IEEE Transactions on Neural Networks, vol. 15, pp. 1517-1525, 2004.
[2] S. Mika, et al., "Kernel PCA and De-Noising in Feature Spaces," in Proceedings of
the 1998 conference on Advances in Neural Information Processing Systems II,
1999.
[3] B. Schölkopf, et al., "Nonlinear Component Analysis as a Kernel Eigenvalue
Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998.
[4] R. Jenssen, "Information Theoretic Learning and Kernel Methods," in Information
Theory and Statistical Learning, Springer US, 2009, pp. 209-230.
[5] R. Jenssen, et al., "Kernel Maximum Entropy Data Transformation and an Enhanced
Spectral Clustering Algorithm," in Proceedings of the 2006 conference on Advances
in Neural Information Processing Systems 19, 2007, pp. 633-640.
[6] R. Jenssen and O. Storås, "Kernel Entropy Component Analysis Pre-images for
Pattern Denoising," in Proceedings of the 16th Scandinavian Conference on Image
Analysis, Oslo, Norway, 2009, pp. 626-635.
[7] R. Jenssen, "Kernel Entropy Component Analysis," IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 32, pp. 847-860, 2010.
33
Download