PCA

- Southeast University - PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering April 13, 2015 1 - Southeast University - Outline • PCA • Kernel Methods • Kernel PCA • Others 2 - Southeast University - 1. PCA Overview • Principal component analysis (PCA) is a way to reduce data dimensionality • PCA projects high dimensional data to a lower dimension • PCA projects the data in the least square sense– it captures big (principal) variability in the data and ignores small variability 3 - Southeast University - PCA: An Intuitive Approach Let us say we have xi, i=1…N data points in p dimensions (p is large) If we want to represent the data set by a single point x0, then 1 x0  m  N Sample mean N x i 1 i Can we justify this choice mathematically? N J 0 (x 0 )   x i  x 0 2 i 1 It turns out that if you minimize J0, you get the above solution, namely, sample mean 4 - Southeast University - PCA: An Intuitive Approach… Representing the data set xi, i=1…N by its mean is quite uninformative So let’s try to represent the data by a straight line of the form: x  m  ae This is equation of a straight line that says that it passes through m e is a unit vector along the straight line And the signed distance of a point x from m is a The training points projected on this straight line would be xi  m  aie, i  1...N 5 - Southeast University - PCA: An Intuitive Approach… Let’s now determine ai’s N J1 (a1 , a2 ,, a N , e)   m  ai e  xi 2 i 1 N N N   a || e ||  2 ai e (xi  m)   || xi  m ||2 i 1 2 i 2 T i 1 i 1 N N N i 1 i 1 i 1   ai2  2 ai eT (xi  m)   || xi  m ||2 Partially differentiating with respect to ai we get: ai  eT (xi  m) Plugging in this expression for ai in J1 we get: N N N J1 (e)   e (xi  m)(xi  m) e   || xi  m ||  e Se   || xi  m ||2 T T i 1 2 i 1 T i 1 N where S   (xi  m)(xi  m)T is called the scatter matrix i 1 6 - Southeast University - PCA: An Intuitive Approach… So minimizing J1 is equivalent to maximizing: eT Se T Subject to the constraint that e is a unit vector: e e  1 Use Lagrange multiplier method to form the objective function: eT Se   (eT e 1) Differentiate to obtain the equation: 2Se  2e  0 or Se  e Solution is that e is the eigenvector of S corresponding to the largest eigenvalue 7 - Southeast University - PCA: An Intuitive Approach… The preceding analysis can be extended in the following way. Instead of projecting the data points on to a straight line, we may now want to project them on a d-dimensional plane of the form: x  m  a1e1    ad ed d is much smaller than the original dimension p In this case one can form the objective function: J d  N d  || (m   a i 1 k 1 e )  xi ||2 ik k It can also be shown that the vectors e1, e2, …, ed are d eigenvectors N corresponding to d largest eigen values of the scatter matrix S   (xi  m)(xi  m)T i 1 8 - Southeast University - PCA: Visually Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors. 9 - Southeast University - PCA Steps • 设x = ( x1 , x2 , ⋯ , xn)T为n 维随机矢量 ⑴ 将原始观察数据组成观察矩阵X，每一列为一个观察样本，每一行为一维 ⑵ 计算样本X的协方差矩阵 covX=COV(X) ⑶ 计算covX的特征值和特征向量，并将特征值按从大到小排列 ⑷ 选取前m个最大特征值对应的特征向量组成矩阵V ⑸ Y=VTX，则Y为降维后的矩阵 10 - Southeast University - PCA的Matlab函数与算法 1.princomp：主成分分析 • PC=princomp(X) • [PC,score,latent,tsquare]=princomp(X) – 对数据矩阵X(N*p，行-观察样本数，列-特征变量数)进行主成分分析，给出各主成分(PC)、所谓的Z-得分(score)、X的方差矩阵的特征值(latent)和每个数据点的HotellingT2统计量(tsquare)。 2.pcacov：运用协方差矩阵进行主成分分析 • PC=pcacov(X) • [PC,latent,explained]=pcacov(X) – 通过协方差矩阵X进行主成分分析，返回主成分(PC)、协方差矩阵 X的特征值(latent)和每个特征向量表征在观测量总方差中所占的百分数(explained)。 11 - Southeast University - 3.pcares：主成分分析的残差 • residuals=pcares(X,ndim) – 返回保留X的ndim个主成分所获的残差。注意，ndim是一个标量，必须小于X的列数。而且，X是数据矩阵，而不是协方差矩阵。 4.barttest：主成分的巴特力特检验 • ndim=barttest(X,alpha) • [ndim,prob,chisquare]=barttest(X,alpha) – 巴特力特检验是一种等方差性检验。ndim=barttest(X,alpha)是在显著性水平alpha下，给出满足数据矩阵X的非随机变量的n维模型， ndim即模型维数，它由一系列假设检验所确定，ndim=1表明数据 X对应于每个主成分的方差是相同的；ndim=2表明数据X对应于第二成分及其余成分的方差是相同的。 12 - Southeast University - 计算协方差 (1)XCOV=COV(X) (2) % row观察样本，col特征变量，返回的cv为协方差 xmean=mean(x); xsize=size(x); for i=1:xsize(2) xx1=x(:,i); mxx1=xmean(i); for j=1:xsize(2) xx2=x(:,j); mxx2=xmean(j); v=((xx1-mxx1)'*(xx2-mxx2))/(xsize(2)-1); cv(i,j)=v; cv(j,i)=v; end end 13 - Southeast University - PCA的Matlab实现 function [xeigvsort,xeigdsort,final]=KL_Exp(x) xmean=mean(x);xsize=size(x); for i=1:xsize(2) xadjust(:,i)=x(:,i)-xmean(:,i); end xcov=cov(xadjust); %计算协方差 [xeigv,xeigd]=eig(xcov); %计算特征值和特征向量 xeigvsort=fliplr(xeigv); %特征向量v排序 xeigdsort=flipud(fliplr(xeigd)); %特征值d降序排序 finaleigs=xeigvsort(:,1:xsize(2));选取变换基，xsize(2)可调 pdata=finaleigs‘*xadjust’; %进行变换 final=pdata'; 14 - Southeast University - 假设和局限 • 线形性假设 – PCA的内部模型是线性的。这也就决定了它能进行的主元分析之间的关系也是线性的。现在比较流行的Kernel-PCA的一类方法就是对原有PCA方法的非线性拓展。 • 使用中值和方差进行充分统计 – 使用中值和方差进行充分的概率分布描述的模型只限于指数型概率分布模型。若数据的概率分布是non-Gaussian，那么PCA将会失效，ICA方法将发挥作用。 15 - Southeast University - • 大方差向量具有较大重要性 – PCA方法隐含了这样的假设：数据本身具有较高的信噪比，所以具有最高方差的一维向量就可以被看作是主元，而方差较小的变化则被认为是噪音。这是由于低通滤波器的选择决定的。 • 主元正交 – PCA方法假设主元向量之间都是正交的，从而可以利用线形代数的一系列有效的数学工具进行求解，大大提高了效率和应用的范围。 16 - Southeast University - 2. Kernel Methods • Find a mapping f such that, in the new space, problem solving is easier (e.g. linear) • The kernel represents the similarity between two objects, defined as the dot-product in this new vector space • But the mapping is left implicit • Easy generalization of a lot of dot-product (or distance) based pattern recognition algorithms 17 - Southeast University - Kernel Methods : the mapping f f f Original Space Feature (Vector) Space 18 - Southeast University - Feature Spaces  : x  ( x), R  F d Non-linear Mapping to F 1. High-d Space 2. Infinite-d Countable Space: L2 3. Function Space (Hilbert Space) Example:  ( x, y)  ( x , y , 2 xy) 2 2 19 - Southeast University - Kernel : more formal definition • A kernel k(x,y) – – – – is a similarity measure defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=f(x)•f(y) • This similarity measure and the mapping include: – – – – – Invariance or other a priori knowledge Simpler structure (linear representation of the data) The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) … but still computational efficiency when computing k(x,y) General Principles governing Kernel Design 20 - Southeast University - Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : x  ( x), kernel Gij  xi , x j  Gij  ( xi ), ( x j )  K ( xi , x j ) If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features  This is the crucial point of kernel methods 21 - Southeast University - Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. Some Kernels: k ( x, y)  e( || x  y|| 2 Some Kernel Algorithms: / c) k ( x, y)  ( x, y   )d k ( x, y)  tanh(  x, y   ) 1 k ( x, y)  || x  y ||2 c 2 - SVM - Fisher LDA(KFDA) - Kernel Regression - Kernel PCA - Kernel CCA 22 - Southeast University - Benefits from kernels • Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, … – When these algorithms are dot-product based, by replacing the dot product (x•y) by k(x,y)=f(x)•f(y) e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, … NM. This often implies to work with the “dual” form of the algo. – When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y) • Freedom of choosing f implies a large variety of learning algorithms 23 - Southeast University - 3. Kernel PCA • Assumption behind PCA is that the data points x are multivariate Gaussian • Often this assumption does not hold • However, it may still be possible that a transformation f(x) is still Gaussian, then we can perform PCA in the space of f(x) • Kernel PCA performs this PCA; however, because of “kernel trick,” it never computes the mapping f(x) explicitly! 24 - Southeast University - KPCA: Basic Idea 25 - Southeast University - Kernel PCA Formulation • We need the following fact: N • Let v be a eigenvector of the scatter matrix: S   xi xTi i 1 • Then v belongs to the linear space spanned by the data points xi i=1, 2, …N. • Proof: Sv  v  v  1 N x (x   i 1 i N T i v)    i x i i 1 26 - Southeast University - Kernel PCA Formulation… • Let C be the scatter matrix of the centered mapping f(x): N C   f (xi )f (xi )T i 1 • Let w be an eigenvector of C, then w can be written as a linear combination: N w    kf (x k ) k 1 • Also, we have: Cw  w • Combining, we get: N N N k 1 k 1 (f (xi )f (xi ) )( kf (x k ))    kf (x k ) T i 1 27 - Southeast University - Kernel PCA Formulation… N N N k 1 k 1 ( f (x i )f (x i ) )(  kf (x k ))     kf (x k )  T i 1 N N N  f (xi )f (xi ) f (x k ) k     kf (x k )  T i 1 k 1 N k 1 N  f (x ) i 1 k 1 Sv  v l N T f (x i )f (x i ) f (x k ) k     kf (x l )T f (x k ), l  1,2,  , N  T k 1 K 2 α   Kα  Kα  α, where K ij  f (x i )T f (x j ). Kernel or Gram matrix 28 - Southeast University - Kernel PCA Formulation… Kα   α From the eigen equation And the fact that the eigenvector w is normalized to 1, we obtain: N N || w ||  (  if (x i )) (  if (x i ))  α T Kα  1  2 T i 1 αT α  i 1 1  29 - Southeast University - KPCA Algorithm Step 1: Compute the Gram matrix: Kij  k (xi , x j ), i, j  1,, N Step 2: Compute (eigenvalue, eigenvector) pairs of K: (α , l ), l  1,, M l Step 3: Normalize the eigenvectors: αl  αl l N Thus, an eigenvector wl of C is now represented as: w   kl f (x k ) l k 1 To project a test feature f(x) onto wl we need to compute: N N f (x) w  f (x) ( f (x k ))   kl k (x k , x) T l T k 1 l k k 1 So, we never need f explicitly 30 - Southeast University - Examples of Kernels f Polynomial kernel (n=2) RBF kernel (n=2) 31 - Southeast University - 4. Others • 2DPCA – 南京理工大学杨静宇教授等, IEEE T-PAMI, 2004(1) – 2DPCA特征提取效果至少要好于PCA, 不过要求的内存比PCA 大。 • 2DLDA – 北京交通大学袁保宗教授等, P. R. Letters, 2005(3) • Kernel ECA（KECA） – Robert Jensson et.al, IEEE T-PAMI, 2010(5) – 最大熵保留，使熵减最少，巧妙地将熵与核方法数据映射结合，将熵的计算顺水推舟化作核矩阵的计算，于是变成一个核空间里的优化问题 32 - Southeast University - References [1] J. T. Y. Kwok and I. W. H. Tsang, "The Pre-Image Problem in Kernel Methods," IEEE Transactions on Neural Networks, vol. 15, pp. 1517-1525, 2004. [2] S. Mika, et al., "Kernel PCA and De-Noising in Feature Spaces," in Proceedings of the 1998 conference on Advances in Neural Information Processing Systems II, 1999. [3] B. Schölkopf, et al., "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998. [4] R. Jenssen, "Information Theoretic Learning and Kernel Methods," in Information Theory and Statistical Learning, Springer US, 2009, pp. 209-230. [5] R. Jenssen, et al., "Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm," in Proceedings of the 2006 conference on Advances in Neural Information Processing Systems 19, 2007, pp. 633-640. [6] R. Jenssen and O. Storås, "Kernel Entropy Component Analysis Pre-images for Pattern Denoising," in Proceedings of the 16th Scandinavian Conference on Image Analysis, Oslo, Norway, 2009, pp. 626-635. [7] R. Jenssen, "Kernel Entropy Component Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 847-860, 2010. 33

PCA

Related documents

Products

Support

PCA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib