- Southeast University - PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering April 13, 2015 1 - Southeast University - Outline • PCA • Kernel Methods • Kernel PCA • Others 2 - Southeast University - 1. PCA Overview • Principal component analysis (PCA) is a way to reduce data dimensionality • PCA projects high dimensional data to a lower dimension • PCA projects the data in the least square sense– it captures big (principal) variability in the data and ignores small variability 3 - Southeast University - PCA: An Intuitive Approach Let us say we have xi, i=1…N data points in p dimensions (p is large) If we want to represent the data set by a single point x0, then 1 x0 m N Sample mean N x i 1 i Can we justify this choice mathematically? N J 0 (x 0 ) x i x 0 2 i 1 It turns out that if you minimize J0, you get the above solution, namely, sample mean 4 - Southeast University - PCA: An Intuitive Approach… Representing the data set xi, i=1…N by its mean is quite uninformative So let’s try to represent the data by a straight line of the form: x m ae This is equation of a straight line that says that it passes through m e is a unit vector along the straight line And the signed distance of a point x from m is a The training points projected on this straight line would be xi m aie, i 1...N 5 - Southeast University - PCA: An Intuitive Approach… Let’s now determine ai’s N J1 (a1 , a2 ,, a N , e) m ai e xi 2 i 1 N N N a || e || 2 ai e (xi m) || xi m ||2 i 1 2 i 2 T i 1 i 1 N N N i 1 i 1 i 1 ai2 2 ai eT (xi m) || xi m ||2 Partially differentiating with respect to ai we get: ai eT (xi m) Plugging in this expression for ai in J1 we get: N N N J1 (e) e (xi m)(xi m) e || xi m || e Se || xi m ||2 T T i 1 2 i 1 T i 1 N where S (xi m)(xi m)T is called the scatter matrix i 1 6 - Southeast University - PCA: An Intuitive Approach… So minimizing J1 is equivalent to maximizing: eT Se T Subject to the constraint that e is a unit vector: e e 1 Use Lagrange multiplier method to form the objective function: eT Se (eT e 1) Differentiate to obtain the equation: 2Se 2e 0 or Se e Solution is that e is the eigenvector of S corresponding to the largest eigenvalue 7 - Southeast University - PCA: An Intuitive Approach… The preceding analysis can be extended in the following way. Instead of projecting the data points on to a straight line, we may now want to project them on a d-dimensional plane of the form: x m a1e1 ad ed d is much smaller than the original dimension p In this case one can form the objective function: J d N d || (m a i 1 k 1 e ) xi ||2 ik k It can also be shown that the vectors e1, e2, …, ed are d eigenvectors N corresponding to d largest eigen values of the scatter matrix S (xi m)(xi m)T i 1 8 - Southeast University - PCA: Visually Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors. 9 - Southeast University - PCA Steps • 设x = ( x1 , x2 , ⋯ , xn)T为n 维随机矢量 ⑴ 将原始观察数据组成观察矩阵X,每一列为一个观察 样本,每一行为一维 ⑵ 计算样本X的协方差矩阵 covX=COV(X) ⑶ 计算covX的特征值和特征向量,并将特征值按从大到 小排列 ⑷ 选取前m个最大特征值对应的特征向量组成矩阵V ⑸ Y=VTX,则Y为降维后的矩阵 10 - Southeast University - PCA的Matlab函数与算法 1.princomp:主成分分析 • PC=princomp(X) • [PC,score,latent,tsquare]=princomp(X) – 对数据矩阵X(N*p,行-观察样本数,列-特征变量数)进行主成分分 析,给出各主成分(PC)、所谓的Z-得分(score)、X的方差矩阵的 特征值(latent)和每个数据点的HotellingT2统计量(tsquare)。 2.pcacov:运用协方差矩阵进行主成分分析 • PC=pcacov(X) • [PC,latent,explained]=pcacov(X) – 通过协方差矩阵X进行主成分分析,返回主成分(PC)、协方差矩阵 X的特征值(latent)和每个特征向量表征在观测量总方差中所占的 百分数(explained)。 11 - Southeast University - 3.pcares:主成分分析的残差 • residuals=pcares(X,ndim) – 返回保留X的ndim个主成分所获的残差。注意,ndim是一个标量, 必须小于X的列数。而且,X是数据矩阵,而不是协方差矩阵。 4.barttest:主成分的巴特力特检验 • ndim=barttest(X,alpha) • [ndim,prob,chisquare]=barttest(X,alpha) – 巴特力特检验是一种等方差性检验。ndim=barttest(X,alpha)是在 显著性水平alpha下,给出满足数据矩阵X的非随机变量的n维模型, ndim即模型维数,它由一系列假设检验所确定,ndim=1表明数据 X对应于每个主成分的方差是相同的;ndim=2表明数据X对应于第 二成分及其余成分的方差是相同的。 12 - Southeast University - 计算协方差 (1)XCOV=COV(X) (2) % row观察样本,col特征变量,返回的cv为协方差 xmean=mean(x); xsize=size(x); for i=1:xsize(2) xx1=x(:,i); mxx1=xmean(i); for j=1:xsize(2) xx2=x(:,j); mxx2=xmean(j); v=((xx1-mxx1)'*(xx2-mxx2))/(xsize(2)-1); cv(i,j)=v; cv(j,i)=v; end end 13 - Southeast University - PCA的Matlab实现 function [xeigvsort,xeigdsort,final]=KL_Exp(x) xmean=mean(x);xsize=size(x); for i=1:xsize(2) xadjust(:,i)=x(:,i)-xmean(:,i); end xcov=cov(xadjust); %计算协方差 [xeigv,xeigd]=eig(xcov); %计算特征值和特征向量 xeigvsort=fliplr(xeigv); %特征向量v排序 xeigdsort=flipud(fliplr(xeigd)); %特征值d降序排序 finaleigs=xeigvsort(:,1:xsize(2));选取变换基,xsize(2)可调 pdata=finaleigs‘*xadjust’; %进行变换 final=pdata'; 14 - Southeast University - 假设和局限 • 线形性假设 – PCA的内部模型是线性的。这也就决定了它能进行的主元分析之 间的关系也是线性的。现在比较流行的Kernel-PCA的一类方法就 是对原有PCA方法的非线性拓展。 • 使用中值和方差进行充分统计 – 使用中值和方差进行充分的概率分布描述的模型只限于指数型概 率分布模型。若数据的概率分布是non-Gaussian,那么PCA将会 失效,ICA方法将发挥作用。 15 - Southeast University - • 大方差向量具有较大重要性 – PCA方法隐含了这样的假设:数据本身具有较高的信噪比,所以 具有最高方差的一维向量就可以被看作是主元,而方差较小的变 化则被认为是噪音。这是由于低通滤波器的选择决定的。 • 主元正交 – PCA方法假设主元向量之间都是正交的,从而可以利用线形代数 的一系列有效的数学工具进行求解,大大提高了效率和应用的范 围。 16 - Southeast University - 2. Kernel Methods • Find a mapping f such that, in the new space, problem solving is easier (e.g. linear) • The kernel represents the similarity between two objects, defined as the dot-product in this new vector space • But the mapping is left implicit • Easy generalization of a lot of dot-product (or distance) based pattern recognition algorithms 17 - Southeast University - Kernel Methods : the mapping f f f Original Space Feature (Vector) Space 18 - Southeast University - Feature Spaces : x ( x), R F d Non-linear Mapping to F 1. High-d Space 2. Infinite-d Countable Space: L2 3. Function Space (Hilbert Space) Example: ( x, y) ( x , y , 2 xy) 2 2 19 - Southeast University - Kernel : more formal definition • A kernel k(x,y) – – – – is a similarity measure defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=f(x)•f(y) • This similarity measure and the mapping include: – – – – – Invariance or other a priori knowledge Simpler structure (linear representation of the data) The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) … but still computational efficiency when computing k(x,y) General Principles governing Kernel Design 20 - Southeast University - Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : x ( x), kernel Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j ) If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features This is the crucial point of kernel methods 21 - Southeast University - Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. Some Kernels: k ( x, y) e( || x y|| 2 Some Kernel Algorithms: / c) k ( x, y) ( x, y )d k ( x, y) tanh( x, y ) 1 k ( x, y) || x y ||2 c 2 - SVM - Fisher LDA(KFDA) - Kernel Regression - Kernel PCA - Kernel CCA 22 - Southeast University - Benefits from kernels • Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, … – When these algorithms are dot-product based, by replacing the dot product (x•y) by k(x,y)=f(x)•f(y) e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, … NM. This often implies to work with the “dual” form of the algo. – When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y) • Freedom of choosing f implies a large variety of learning algorithms 23 - Southeast University - 3. Kernel PCA • Assumption behind PCA is that the data points x are multivariate Gaussian • Often this assumption does not hold • However, it may still be possible that a transformation f(x) is still Gaussian, then we can perform PCA in the space of f(x) • Kernel PCA performs this PCA; however, because of “kernel trick,” it never computes the mapping f(x) explicitly! 24 - Southeast University - KPCA: Basic Idea 25 - Southeast University - Kernel PCA Formulation • We need the following fact: N • Let v be a eigenvector of the scatter matrix: S xi xTi i 1 • Then v belongs to the linear space spanned by the data points xi i=1, 2, …N. • Proof: Sv v v 1 N x (x i 1 i N T i v) i x i i 1 26 - Southeast University - Kernel PCA Formulation… • Let C be the scatter matrix of the centered mapping f(x): N C f (xi )f (xi )T i 1 • Let w be an eigenvector of C, then w can be written as a linear combination: N w kf (x k ) k 1 • Also, we have: Cw w • Combining, we get: N N N k 1 k 1 (f (xi )f (xi ) )( kf (x k )) kf (x k ) T i 1 27 - Southeast University - Kernel PCA Formulation… N N N k 1 k 1 ( f (x i )f (x i ) )( kf (x k )) kf (x k ) T i 1 N N N f (xi )f (xi ) f (x k ) k kf (x k ) T i 1 k 1 N k 1 N f (x ) i 1 k 1 Sv v l N T f (x i )f (x i ) f (x k ) k kf (x l )T f (x k ), l 1,2, , N T k 1 K 2 α Kα Kα α, where K ij f (x i )T f (x j ). Kernel or Gram matrix 28 - Southeast University - Kernel PCA Formulation… Kα α From the eigen equation And the fact that the eigenvector w is normalized to 1, we obtain: N N || w || ( if (x i )) ( if (x i )) α T Kα 1 2 T i 1 αT α i 1 1 29 - Southeast University - KPCA Algorithm Step 1: Compute the Gram matrix: Kij k (xi , x j ), i, j 1,, N Step 2: Compute (eigenvalue, eigenvector) pairs of K: (α , l ), l 1,, M l Step 3: Normalize the eigenvectors: αl αl l N Thus, an eigenvector wl of C is now represented as: w kl f (x k ) l k 1 To project a test feature f(x) onto wl we need to compute: N N f (x) w f (x) ( f (x k )) kl k (x k , x) T l T k 1 l k k 1 So, we never need f explicitly 30 - Southeast University - Examples of Kernels f Polynomial kernel (n=2) RBF kernel (n=2) 31 - Southeast University - 4. Others • 2DPCA – 南京理工大学杨静宇教授等, IEEE T-PAMI, 2004(1) – 2DPCA特征提取效果至少要好于PCA, 不过要求的内存比PCA 大。 • 2DLDA – 北京交通大学袁保宗教授等, P. R. Letters, 2005(3) • Kernel ECA(KECA) – Robert Jensson et.al, IEEE T-PAMI, 2010(5) – 最大熵保留,使熵减最少,巧妙地将熵与核方法数据映射结合, 将熵的计算顺水推舟化作核矩阵的计算,于是变成一个核空间里 的优化问题 32 - Southeast University - References [1] J. T. Y. Kwok and I. W. H. Tsang, "The Pre-Image Problem in Kernel Methods," IEEE Transactions on Neural Networks, vol. 15, pp. 1517-1525, 2004. [2] S. Mika, et al., "Kernel PCA and De-Noising in Feature Spaces," in Proceedings of the 1998 conference on Advances in Neural Information Processing Systems II, 1999. [3] B. Schölkopf, et al., "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998. [4] R. Jenssen, "Information Theoretic Learning and Kernel Methods," in Information Theory and Statistical Learning, Springer US, 2009, pp. 209-230. [5] R. Jenssen, et al., "Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm," in Proceedings of the 2006 conference on Advances in Neural Information Processing Systems 19, 2007, pp. 633-640. [6] R. Jenssen and O. Storås, "Kernel Entropy Component Analysis Pre-images for Pattern Denoising," in Proceedings of the 16th Scandinavian Conference on Image Analysis, Oslo, Norway, 2009, pp. 626-635. [7] R. Jenssen, "Kernel Entropy Component Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 847-860, 2010. 33